On 8/9/16 4:15 AM, Carlos Maiolino wrote: > Document the implementation of error handlers into sysfs. > > Changelog: > > V2: > - Add a description of the precedence order of each option, focusing on > the behavior of "fail_at_unmount" which was not well explained in V1 Global: s/shutdown/shut down/g > Signed-off-by: Carlos Maiolino <cmaiolino@xxxxxxxxxx> > --- > Documentation/filesystems/xfs.txt | 94 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 94 insertions(+) > > diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt > index 8146e9f..d483e0b 100644 > --- a/Documentation/filesystems/xfs.txt > +++ b/Documentation/filesystems/xfs.txt > @@ -348,3 +348,97 @@ Removed Sysctls > ---- ------- > fs.xfs.xfsbufd_centisec v4.0 > fs.xfs.age_buffer_centisecs v4.0 > + > +Error handling > +============== > + > +XFS can act differently according with the type of error found "according to" > +during its operation. The implementation introduces the following > +concepts to the error handler: > + > + -failure speed: > + Defines how fast XFS should shutdown in case of a specific > + error is found during the filesystem operation. It can > + shutdown immediately, after a defined number of tries, or > + simply try forever, which was the old behavior and is now > + set as default behavior, except during unmount time, where > + in case of a error is found while unmounting, the filesystem > + will shutdown. Defines how fast XFS should shut down when a specific error is found during filesystem operation. It can shut down immediately, after a defined number of retries or after a set time period, or simply retry forever. The old "retry forever" behavior is still the default, except during unmount, where any IOs retrying due to errors will be cancelled and unmount will be allowed to proceed. > + > + -error classes: > + Specifies the subsystem/location where the error handlers > + configure the behavior for, such as metadata or memory allocation. Specifies the subsystem/location for the error handlers, such as metadata or memory allocation. Only metadata IO errors are handled at this time. > + > + -error handlers: > + Defines the behavior for a specific error. > + > +The filesystem behavior during an error can be set via sysfs files, where, the s/where,/where/ > +errors are organized with the following structure: > + > + /sys/fs/xfs/<dev>/error/<class>/<error>/ where <dev> is the device name for the mounted filesystem. > + > +Each directory contains: > + > + /sys/fs/xfs/<dev>/error/ > + > + fail_at_unmount (Min: 0 Default: 1 Max: 1) > + Defines the global error behavior during unmount time. If set to > + "1", XFS will shutdown in case of any error is found, otherwise, > + if set to "0", the filesystem will indefinitely retry to cleanly > + unmount the filesystem. Defines the global error behavior at unmount time. If set to the default value of 1, XFS will cancel any pending IO retries, shut down, and unmount. If set to 0, pending IO retries may prevent the filesystem from unmounting. > + > + <class> subdirectories > + Contains specific error handlers configuration > + (Ex: /sys/fs/xfs/<dev>/error/metadata). + (Ex: /sys/fs/xfs/<dev>/error/metadata, see below) > + > + /sys/fs/xfs/<dev>/error/<class>/ > + Directory containing configuration for a specific error <class>; currently only the "metadata" <class> is implemented. > + The contents of this directory are <class> specific, since each <class> > + might need to handle different types of errors. Remove this part: > + All <error> directory > + though, contains the "default" directory, which is a global configuration > + for errors not available for independent configuration. because there is no "default" under the <error> - and you haven't described <error> yet, anyway... "default" is covered properly below. > + > + /sys/fs/xfs/<dev>/error/<class>/<error> /sys/fs/xfs/<dev>/error/<class>/<error>/ to indicate that it is a directory, not a file. > + > + Contains the failure speed configuration files for each specific error, > + including the "default" behavior, which contains the same configuration > + options as the specific errors. > + > + The available configurations for each error type are: Contains the failure speed configuration files for specific errors in this <class>, as well as a "default" behavior. Each <error> directory contains the following configuration files; the first condition met for a specific configuration will cause the filesystem to shut down: > + > + max_retries (Min: -1 Default: -1 Max: INTMAX) > + Define how many tries the filesystem is allowed to retry its > + operations during the specific error, before shutdown the > + filesystem. Setting this file to "-1", will set XFS to retry > + forever in the specific error, setting it to "0", will make > + XFS to fail immediately after the specific error is found, > + while setting it to a "N" value, where N is greater than 0, > + will make XFS retry "N" times before shutdown. Defines the allowed number of retries of a specific error before the filesystem will shut down. The default value of "-1" will cause XFS to retry forever for this specific error. Setting it to "0" will cause XFS to fail immediately when the specific error is found, and setting it to "N," where N is greater than 0, will make XFS retry "N" times before shutting down. > + > + retry_timeout_seconds (Min: 0 Default: 0 Max: INTMAX) > + Define the amount of time (in seconds) that the filesystem is > + allowed to retry its operations when the specific error is > + found. "0" means no wait time. does "no wait time" mean shut down immediately, or ignore the wait time altogether? I think it is the latter, so: Define the amount of time (in seconds) that the filesystem is allowed to retry its operations when this specific error is found. The default value of "0" will cause XFS to retry forever. Right? As an aside, I'm not sure why we don't have the consistency of "-1" meaning "ignore this setting" as we do for max_retries. Dave, think it's too late to change that? > + > + > + > + Order of precedence: > + "max_retries" takes precedence over "retry_timeout_seconds", > + where, "retry_timeout_seconds" will only be tested if > + "max_retries" limit was not reached yet or is set to retry > + forever ("-1"). If "max_retries" limit is reached, the > + filesystem will shutdown, wether or not "retry_timeout_seconds" > + has been reached. This doesn't seem quite right. Isn't it simply the *first* threshold to be reached, whether it is max_retries or retry_timeout_seconds? So that's not really an order of precedence. I think: I think that all of this can be simply moved up to the descriptions above. > + "fail_at_unmount" on the other hand, works independently of the > + remainder options. It will only be tested during unmount time, > + but, it will shutdown the filesystem independent of the limits > + set into "max_retries" or "retry_timeout_seconds". > + It has been added because sysfs configuration can't be changed > + after an unmount is triggered, once the sysfs directory from > + the filesystem being unmounted will be detached from the sysfs > + tree, so, even if the sysadmin wants to make XFS retry forever > + for any error during the filesystem operation, the filesystem > + can still be properly unmounted if any error was detected and > + "fail_at_unmount" is set. Otherwise, the umount process get > + stuck forever. > I'd leave this all out.
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs