Re: [PATCH] xfs: Document error handlers behavior [V2]

Eric Sandeen <sandeen@xxxxxxxxxxx> · Wed, 31 Aug 2016 09:18:25 -0500

On 8/9/16 4:15 AM, Carlos Maiolino wrote:
> Document the implementation of error handlers into sysfs.
> 
> Changelog:
> 
> V2:
> 	- Add a description of the precedence order of each option, focusing on
> 	  the behavior of "fail_at_unmount" which was not well explained in V1

Global: s/shutdown/shut down/g

> Signed-off-by: Carlos Maiolino <cmaiolino@xxxxxxxxxx>
> ---
>  Documentation/filesystems/xfs.txt | 94 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 94 insertions(+)
> 
> diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
> index 8146e9f..d483e0b 100644
> --- a/Documentation/filesystems/xfs.txt
> +++ b/Documentation/filesystems/xfs.txt
> @@ -348,3 +348,97 @@ Removed Sysctls
>    ----				-------
>    fs.xfs.xfsbufd_centisec	v4.0
>    fs.xfs.age_buffer_centisecs	v4.0
> +
> +Error handling
> +==============
> +
> +XFS can act differently according with the type of error found

"according to"

> +during its operation. The implementation introduces the following
> +concepts to the error handler:
> +
> + -failure speed:
> +	Defines how fast XFS should shutdown in case of a specific
> +	error is found during the filesystem  operation. It can
> +	shutdown immediately, after a defined number of tries, or
> +	simply try forever, which was the old behavior and is now
> +	set as default behavior, except during unmount time, where
> +	in case of a error is found while unmounting, the filesystem
> +	will shutdown.

Defines how fast XFS should shut down when a specific error is found
during filesystem operation.  It can shut down immediately, after a
defined number of retries or after a set time period, or simply retry
forever.  The old "retry forever" behavior is still the default, except
during unmount, where any IOs retrying due to errors will be cancelled
and unmount will be allowed to proceed.

> +
> + -error classes:
> +	Specifies the subsystem/location where the error handlers
> +	configure the behavior for, such as metadata or memory allocation.

Specifies the subsystem/location for the error handlers, such as
metadata or memory allocation.  Only metadata IO errors are handled 
at this time.
> +
> + -error handlers:
> +	Defines the behavior for a specific error.
> +
> +The filesystem behavior during an error can be set via sysfs files, where, the

s/where,/where/

> +errors are organized with the following structure:
> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/

where <dev> is the device name for the mounted filesystem.

> +
> +Each directory contains:
> +
> + /sys/fs/xfs/<dev>/error/
> +
> +	fail_at_unmount		(Min:  0  Default:  1  Max: 1)
> +		Defines the global error behavior during unmount time. If set to
> +		"1", XFS will shutdown in case of any error is found, otherwise,
> +		if set to "0", the filesystem will indefinitely retry to cleanly
> +		unmount the filesystem.

Defines the global error behavior at unmount time.  If set to the default value of
1, XFS will cancel any pending IO retries, shut down, and unmount.  If set to
0, pending IO retries may prevent the filesystem from unmounting.

> +
> +	<class> subdirectories
> +		Contains specific error handlers configuration
> +		(Ex: /sys/fs/xfs/<dev>/error/metadata).

+		(Ex: /sys/fs/xfs/<dev>/error/metadata, see below)

> +
> + /sys/fs/xfs/<dev>/error/<class>/
> +

	Directory containing configuration for a specific error <class>;
	currently only the "metadata" <class> is implemented.

> +	The contents of this directory are <class> specific, since each <class>
> +	might need to handle different types of errors.

Remove this part:

> +     All <error> directory
> +	though, contains the "default" directory, which is a global configuration
> +	for errors not available for independent configuration.

because there is no "default" under the <error> - and you haven't described
<error> yet, anyway... "default" is covered properly below.

> +
> + /sys/fs/xfs/<dev>/error/<class>/<error>

/sys/fs/xfs/<dev>/error/<class>/<error>/

to indicate that it is a directory, not a file.

> +
> +	Contains the failure speed configuration files for each specific error,
> +	including the "default" behavior, which contains the same configuration
> +	options as the specific errors.
> +
> +	The available configurations for each error type are:

Contains the failure speed configuration files for specific errors in this
<class>, as well as a "default" behavior.  Each <error> directory contains
the following configuration files; the first condition met for a specific
configuration will cause the filesystem to shut down:

> +
> +	max_retries			(Min: -1  Default: -1  Max: INTMAX)
> +		Define how many tries the filesystem is allowed to retry its
> +		operations during the specific error, before shutdown the
> +		filesystem. Setting this file to "-1", will set XFS to retry
> +		forever in the specific error, setting it to "0", will make
> +		XFS to fail immediately after the specific error is found,
> +		while setting it to a "N" value, where N is greater than 0,
> +		will make XFS retry "N" times before shutdown.

Defines the allowed number of retries of a specific error before the filesystem
will shut down.  The default value of "-1" will cause XFS to retry forever
for this specific error.  Setting it to "0" will cause XFS to fail immediately
when the specific error is found, and setting it to "N," where N is greater
than 0, will make XFS retry "N" times before shutting down.

> +
> +	retry_timeout_seconds		(Min:  0  Default:  0  Max: INTMAX)
> +		Define the amount of time (in seconds) that the filesystem is
> +		allowed to retry its operations when the specific error is
> +		found. "0" means no wait time.

does "no wait time" mean shut down immediately, or ignore the wait time
altogether?  I think it is the latter, so:

Define the amount of time (in seconds) that the filesystem is allowed to
retry its operations when this specific error is found.  The default
value of "0" will cause XFS to retry forever.

Right?

As an aside, I'm not sure why we don't have the consistency of "-1"
meaning "ignore this setting" as we do for max_retries.  Dave, think it's
too late to change that?

> +
> +
> +
> + Order of precedence:
> +		"max_retries" takes precedence over "retry_timeout_seconds",
> +		where, "retry_timeout_seconds" will only be tested if
> +		"max_retries" limit was not reached yet or is set to retry
> +		forever ("-1"). If "max_retries" limit is reached, the
> +		filesystem will shutdown, wether or not "retry_timeout_seconds"
> +		has been reached.

This doesn't seem quite right.  Isn't it simply the *first* threshold to be
reached, whether it is max_retries or retry_timeout_seconds?  So that's not
really an order of precedence.  I think:

I think that all of this can be simply moved up to the descriptions above.

> +		"fail_at_unmount" on the other hand, works independently of the
> +		remainder options. It will only be tested during unmount time,
> +		but, it will shutdown the filesystem independent of the limits
> +		set into "max_retries" or "retry_timeout_seconds".
> +		It has been added because sysfs configuration can't be changed
> +		after an unmount is triggered, once the sysfs directory from
> +		the filesystem being unmounted will be detached from the sysfs
> +		tree, so, even if the sysadmin wants to make XFS retry forever
> +		for any error during the filesystem operation, the filesystem
> +		can still be properly unmounted if any error was detected and
> +		"fail_at_unmount" is set. Otherwise, the umount process get
> +		stuck forever.
> 

I'd leave this all out.

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs