Re: [PATCH] xfs: Introduce permanent async buffer write IO failures

Eric Sandeen <sandeen@xxxxxxxxxxx> · Wed, 18 Feb 2015 17:14:15 -0600

On 2/18/15 4:32 PM, Dave Chinner wrote:
>  	/*
> -	 * If the write of the buffer was synchronous, we want to make
> -	 * sure to return the error to the caller of xfs_bwrite().
> +	 * Repeated failure on an async write.
> +	 *
> +	 * Now we need to classify the error and determine the correct action to
> +	 * take. Different types of errors will require different processing,
> +	 * but make the default classification "transient" so that we keep
> +	 * retrying in these cases.  If this is the wrog thing to do, then we'll
> +	 * get reports that the filesystem hung in the presence of that type of
> +	 * error and we can take appropriate action to remedy the issue for that
> +	 * type of error.
>  	 */

So, I think this is the tricky part.

No errno has a universal meaning, and we don't know what kind of block device
we're talking to.  One device's ENOSPC may be another's EIO, and either may or
may not be permanent, maybe ENODEV *isn't* permanent.  (...is it always permanent?)

My first feeble hack at this was counting consecutive errors, and hard failing
after a set number.  Thinking about that later, it seems like something time-based
might be better than io-count-based.

Can we really simply switch on an error?  If nothing else, this might have
to be configurable somehow, so that an admin can choose which errors for
which device are desired to be "permanent."

(I think that's accurately summing up irc-and-side-channel discussions) ;)

Thanks,
-Eric

> +	switch (bp->b_error) {
> +	case ENODEV:
> +		/* permanent error, no recovery possible */
> +		xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> +		goto out_stale;
> +	default:
> +		/* do nothing, higher layers will retry */
> +		break;
> +	}
> +
> +	xfs_buf_relse(bp);
> +	return true;

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs