Re: [PATCH v4 00/17] xfs: flush related error handling cleanups

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 6 May 2020 07:04:00 -0400

On Wed, May 06, 2020 at 08:36:43AM +1000, Dave Chinner wrote:
> On Tue, May 05, 2020 at 07:58:10AM -0400, Brian Foster wrote:
> > On Tue, May 05, 2020 at 07:53:07AM +1000, Dave Chinner wrote:
> > > On Mon, May 04, 2020 at 10:11:37AM -0400, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > I think everything has been reviewed to this point. Only minor changes
> > > > noted below in this release. A git repo is available here[1].
> > > > 
> > > > The only outstanding feedback that I'm aware of is Dave's comment on
> > > > patch 7 of v3 [2] regarding the shutdown assert check. I'm not aware of
> > > > any means to get through xfs_wait_buftarg() with a dirty buffer that
> > > > hasn't undergone the permanant error sequence and thus shut down the fs.
> > > 
> > > # echo 0 > /sys/fs/xfs/<dev>/fail_at_unmount
> > > 
> > > And now any error with a "retry forever" config (the default) will
> > > be collected by xfs_buftarg_wait() without a preceeding shutdown as
> > > xfs_buf_iodone_callback_error() will not treat it as a permanent
> > > error during unmount. i.e. this doesn't trigger:
> > > 
> > >         /* At unmount we may treat errors differently */
> > >         if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
> > >                 goto permanent_error;
> > > 
> > > and so the error handling just marks it with a write error and lets
> > > it go for a write retry in future. These are then collected in
> > > xfs_buftarg_wait() as nothing is going to retry them once unmount
> > > gets to this point...
> > > 
> > 
> > That doesn't accurately describe the behavior of that configuration,
> > though. "Retry forever" means that dirty buffers are going to cycle
> > through submission retries and the unmount is going to hang indefinitely
> > (on pushing the AIL). Indeed, preventing this unmount hang is the
> > original purpose of the fail at unmount knob (commit here[1]).
> 
> So this patch is relying on something else hanging before we get to
> xfs_wait_buftarg and hence "we don't have to handle that here".
> 

Yes, though it's the existing code that relies on the external push to
block until everything is written back or a permanent error is issued.
This patch just splits up and rate limits a warning message. I added the
assert because of the previous discussion around making sure this
situation is handled properly and because that error processing
dependency is not obvious.

> Please document that assumption along with the ASSERT, because if we
> change AIL behaviour to prevent retry-forever hangs with freeze
> and/or remount-ro we're going to hit this assert...
> 

Sure, I'll add a comment..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>