Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 18 Feb 2014 11:23:12 +1100

On Mon, Feb 17, 2014 at 04:16:42PM +0100, Jan Kara wrote:
>   Hi Dave,
> 
> On Mon 17-02-14 15:40:47, Dave Chinner wrote:
> > I just had a tester report to me that he'd bisected an XFS assert
> > failure on unmount to this commit:
> > 
> > # git bisect bad
> > c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit
> > commit c4a391b53a72d2df4ee97f96f78c1d5971b47489
> > Author: Jan Kara <jack@xxxxxxx>
> > Date:   Tue Nov 12 15:07:51 2013 -0800
> > 
> >     writeback: do not sync data dirtied after sync start
> > 
> >     When there are processes heavily creating small files while sync(2) is
> >     running, it can easily happen that quite some new files are created
> >     between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
> >     especially if there are several busy filesystems (remember that sync
> >     traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
> >     fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
> >     (e.g.  causes a transaction commit and cache flush for each inode in
> >     ext3), resulting sync(2) times are rather large
> > 
> > The XFS assert failure was that on unmount there were delayed
> > allocation blocks accounted to an inode when it was being evicted.
> > i.e. during the evict_inodes() call in generic_shutdown_super() after
> > the filesystem had been synced. There should be no dirty data at
> > this point in time....
> > 
> > The writeback code is a maze of twisty passages, so I'm not 100%
> > sure of my reasoning - the call chain analysis I did is below so you
> > can confirm I didn't miss anything. Assuming I got it right,
> > however....
> > 
> > What it looks like is that there's a problem with requeue_inode()
> > behaviour and redirty_page_for_writepage(). If write_cache_pages()
> > runs out of it's writeback chunk, the inode is still dirty when it
> > returns to writeback_sb_inodes(). This leads to requeue_inode()
> > updating the dirtied_when field in the inode for WB_SYNC_NONE +
> > wbc->tagged_writepages, then calling redirty_tail() because
> > wbc->pages_skipped is set. That puts the inode back on b_dirty with
> > a timestamp after the sync started, and so the WB_SYNC_ALL pass
> > skips it, hence it doesn't get synced to disk even though it should.
> > IOWs, sync_filesystem can silently fail to write dirty data to disk.
>   Yes, this analysis is basically correct. The only small imprecision is
> that we cannot run out of writeback chunk when doing tagged_writepages
> writeback - writeback_chunk_size() sets nr_to_write to LONG_MAX for such
> writeback regardless of what was passed in the work item. But still the
> inode will be dirty because of redirty_page_for_writepage() after we return
> to writeback_sb_inodes() so this imprecision doesn't make a difference.

Ah, right, I missed the writeback_chunk_size() behaviour. I knew
there'd be something I missed. ;)

> > In this case, XFS is skipping pages because it can't get the inode
> > metadata lock without blocking on it, and we are in non-blocking
> > mode because we are doing WB_SYNC_NONE writeback. We could also
> > check for wbc->tagged_writepages, but nobody else does, nor does it
> > fix the problem of calling redirty_page_for_writepage() in
> > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > control specific contraints on their calls to
> > redirty_page_for_writepage() and so it seems to me like it's a
> > generic issue, not just an XFS issue.
> > 
> > Digging deeper, it looks to me like our sync code simply does not
> > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > properly.
>   Well, there are two different things:
> a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> we are in trouble because definition of that writeback is that it must
> write everything. So I would consider that a fs bug (we could put a WARN_ON
> for this into redirty_page_for_writepage()). Arguably, we could be nice to
> filesystems and instead of warning just retry writeback indefinitely but
> unless someone comes with a convincing usecase I'd rather avoid that.

Right, that might be true, but almost all .writepages
implementations unconditionally call redirty_page_for_writepage() in
certain circumstances. e.g. xfs/ext4/btrfs do it when called from
direct reclaim context to avoid the possibility of stack overruns.
ext4 does it unconditionally when a memory allocation fails, etc.

So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
all conditions, and quite frankly I'd prefer that we fail a
WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
stuck between a rock and a hard place with that.

> b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> a different matter. There it is clearly allowed and writeback code must
> handle that gracefully.

*nod*

> > It looks to me like requeue_inode should never rewrite
> > the timestamp of the inode if we skipped pages, because that means
> > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > wbc->tagged_writepages writeback. Further, if there are skipped
> > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > to do another pass on the inode to ensure we writeback the skipped
> > pages in this writeback pass regardless of the WB_SYNC flags or
> > wbc->tagged_writepages field.
>   Resetting timestamp in requeue_inode() is one thing which causes problems
> but even worse seems the redirty_tail() call which also updates the
> i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> exclude the inode from running sync(2) writeback and that's wrong.

*nod*

I missed that aspect of the redirty_tail() behaviour, too. Forest,
trees. This aspect of the problem may be more important than the
problem with skipped pages....

> Either
> I'll quickly find a way for redirty_tail() to not clobber i_dirtied_when or
> I'll ask Linus to revert the above commit so that we have more time for
> thinking...

No worries. Thanks for looking into the problem, Jan.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html