Re: [PATCH 5/7] xfs: add configuration of error failure speed

Carlos Maiolino <cmaiolino@xxxxxxxxxx> · Wed, 4 May 2016 11:57:29 +0200

On Wed, May 04, 2016 at 09:24:19AM +1000, Dave Chinner wrote:
> On Tue, May 03, 2016 at 07:55:38PM +0200, Carlos Maiolino wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > On reception of an error, we can fail immediately, perform some
> > bound amount of retries or retry indefinitely. The current behaviour
> > we have is to retry forever.
> > 
> > However, we'd like the ability to choose what behaviour we have, and
> > that requires the ability to configure the behaviour through the new
> > sysfs interfaces.
> > 
> > max_retries is used to set the number of retries to do until permanently fail,
> > the only special case is when it is set to -1, which, will cause the system to
> > retry indefinitely.
> 
> This doesn't read particularly well, and its more a description of
> the mechanism (which shoul dbe in the code) rather than the reason
> for implementing it.. The original commit message talked about
> different failure characteristics, which still need to be mentioned
> here. e.g:
> 
> 	We'd like to be able to configure errors to either fail
> 	immediately (fail fast), retry for some time before
> 	failing (fail slow) or retry forever (fail never). This is
> 	implemented by using specific values for the number of
> 	retries we allow.

Fair enough, I will re-work the patch descriptions in the next version.

> > 
> > Add both a maximum retry count and a retry timeout so that we can
> > bound by time and/or physical IO attempts.
> > 
> > Finally, plumb these into xfs_buf_iodone error processing so that
> > the error behaviour follows the selected configuration.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Signed-off-by:
> > Carlos Maiolino <cmaiolino@xxxxxxxxxx>
> 
> You also need to document what changes you've made to the original
> patches. I can't tell what you've changed in each patch just by
> looking at them - this is the first one where I noticed something
> while reading it. i.e. the commit message should have:
> 
> [ cmaiolino: removed fail-fast/slow/never and instead use retry count
> to determine behaviour ]
>

Ok

> > ---
> >  fs/xfs/xfs_buf.h      | 23 ++++++++++++++-
> >  fs/xfs/xfs_buf_item.c | 13 +++++++--
> >  fs/xfs/xfs_mount.h    |  3 +-
> >  fs/xfs/xfs_sysfs.c    | 81 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  4 files changed, 112 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index adef116..bd978f0 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -183,7 +183,28 @@ typedef struct xfs_buf {
> >  	unsigned int		b_page_count;	/* size of page array */
> >  	unsigned int		b_offset;	/* page offset in first page */
> >  	int			b_error;	/* error code on I/O */
> > -	int			b_last_error;	/* previous async I/O error */
> > +
> > +	/*
> > +	 * async write failure retry count. Initialised to zero on the first
> > +	 * failure, then when it exceeds the maximum configured without a
> > +	 * success the write is considered to be failed permanently and the
> > +	 * iodone handler will take appropriate action.
> > +	 *
> > +	 * For retry timeouts, we record the jiffie of the first failure. This
> > +	 * means that we can change the retry timeout and it on the next error
> > +	 * it will be checked against the newly configured timeout. This
> > +	 * prevents buffers getting stuck in retry loops with a really long
> > +	 * timeout.
> > +	 *
> > +	 * last_error is used to ensure that we are getting repeated errors, not
> > +	 * different errors. e.g. a block device might change ENOSPC to EIO when
> > +	 * a failure timeout occurs, so we want to re-initialise the error
> > +	 * retry behaviour appropriately when that happens.
> > +	 */
> > +	int			b_retries;
> > +	unsigned long		b_first_retry_time; /* in jiffies */
> > +	int			b_last_error;
> > +
> >  	const struct xfs_buf_ops	*b_ops;
> >  
> >  #ifdef XFS_BUF_LOCK_TRACKING
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 3a30a88..68ed2a0 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -1086,6 +1086,9 @@ xfs_buf_iodone_callback_error(
> >  		bp->b_flags |= (XBF_WRITE | XBF_ASYNC |
> >  			        XBF_DONE | XBF_WRITE_FAIL);
> >  		bp->b_last_error = bp->b_error;
> > +		bp->b_retries = 0;
> > +		bp->b_first_retry_time = jiffies;
> > +
> >  		xfs_buf_ioerror(bp, 0);
> >  		xfs_buf_submit(bp);
> >  		return true;
> > @@ -1095,8 +1098,13 @@ xfs_buf_iodone_callback_error(
> >  	 * Repeated failure on an async write. Take action according to the
> >  	 * error configuration we have been set up to use.
> >  	 */
> > -	if (!cfg->max_retries)
> > -		goto permanent_error;
> > +	if (cfg->max_retries >= 0)
> > +		if (++bp->b_retries > cfg->max_retries)
> > +			goto permanent_error;
> > +	if (cfg->retry_timeout)
> > +		if (time_after(jiffies,
> > +			       cfg->retry_timeout + bp->b_first_retry_time))
> > +			goto permanent_error;
> 
> This is hard to read and the pattern is prone to future maintenance
> problems. i.e. nested single line if statements are prone to people
> making mistakes with indentation when adding more conditions to
> them.
> 
> So:
> 
> 	if (cfg->max_retries != XFS_ERR_RETRY_FOREVER &&
> 	    ++bp->b_retries > cfg->max_retries)
> 		goto permanent_error;
> 	if (cfg->retry_timeout &&
> 	    time_after(jiffies, cfg->retry_timeout + bp->b_first_retry_time))
> 		goto permanent_error;
>

Fair enough, will change this in the next revision.

Thanks for the review

> >  	/* still a transient error, higher layers will retry */
> >  	xfs_buf_ioerror(bp, 0);
> > @@ -1139,6 +1147,7 @@ xfs_buf_iodone_callbacks(
> >  	 * retry state here in preparation for the next error that may occur.
> >  	 */
> >  	bp->b_last_error = 0;
> > +	bp->b_retries = 0;
> >  
> >  	xfs_buf_do_callbacks(bp);
> >  	bp->b_fspriv = NULL;
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 0c5a976..0382140 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -54,7 +54,8 @@ enum {
> >  
> >  struct xfs_error_cfg {
> >  	struct xfs_kobj	kobj;
> > -	int		max_retries;
> > +	int		max_retries;
> 
> #define XFS_ERR_RETRY_FOREVER	(-1)
> 
> > +	unsigned long	retry_timeout;	/* in jiffies, 0 = no timeout */
> >  };
> ....
> > +static ssize_t
> > +max_retries_store(
> > +	struct kobject	*kobject,
> > +	const char	*buf,
> > +	size_t		count)
> > +{
> > +	struct xfs_error_cfg *cfg = to_error_cfg(kobject);
> > +	int		ret;
> > +	int		val;
> > +
> > +	ret = kstrtoint(buf, 0, &val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (val < -1 || val > INT_MAX)
> > +		return -EINVAL;
> 
> Should special case XFS_ERR_RETRY_FOREVER here. i.e. range is
> 0-INT_MAX || XFS_ERR_RETRY_FOREVER.
> 
> 	if ((val < -1 || val > INT_MAX) &&
> 	    val != XFS_ERR_RETRY_FOREVER)
> 		return -EINVAL;
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

-- 
Carlos

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs