Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?

Jan Kara <jack@xxxxxxx> · Mon, 17 Feb 2014 16:16:42 +0100

  Hi Dave,

On Mon 17-02-14 15:40:47, Dave Chinner wrote:
> I just had a tester report to me that he'd bisected an XFS assert
> failure on unmount to this commit:
> 
> # git bisect bad
> c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit
> commit c4a391b53a72d2df4ee97f96f78c1d5971b47489
> Author: Jan Kara <jack@xxxxxxx>
> Date:   Tue Nov 12 15:07:51 2013 -0800
> 
>     writeback: do not sync data dirtied after sync start
> 
>     When there are processes heavily creating small files while sync(2) is
>     running, it can easily happen that quite some new files are created
>     between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
>     especially if there are several busy filesystems (remember that sync
>     traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
>     fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
>     (e.g.  causes a transaction commit and cache flush for each inode in
>     ext3), resulting sync(2) times are rather large
> 
> The XFS assert failure was that on unmount there were delayed
> allocation blocks accounted to an inode when it was being evicted.
> i.e. during the evict_inodes() call in generic_shutdown_super() after
> the filesystem had been synced. There should be no dirty data at
> this point in time....
> 
> The writeback code is a maze of twisty passages, so I'm not 100%
> sure of my reasoning - the call chain analysis I did is below so you
> can confirm I didn't miss anything. Assuming I got it right,
> however....
> 
> What it looks like is that there's a problem with requeue_inode()
> behaviour and redirty_page_for_writepage(). If write_cache_pages()
> runs out of it's writeback chunk, the inode is still dirty when it
> returns to writeback_sb_inodes(). This leads to requeue_inode()
> updating the dirtied_when field in the inode for WB_SYNC_NONE +
> wbc->tagged_writepages, then calling redirty_tail() because
> wbc->pages_skipped is set. That puts the inode back on b_dirty with
> a timestamp after the sync started, and so the WB_SYNC_ALL pass
> skips it, hence it doesn't get synced to disk even though it should.
> IOWs, sync_filesystem can silently fail to write dirty data to disk.
  Yes, this analysis is basically correct. The only small imprecision is
that we cannot run out of writeback chunk when doing tagged_writepages
writeback - writeback_chunk_size() sets nr_to_write to LONG_MAX for such
writeback regardless of what was passed in the work item. But still the
inode will be dirty because of redirty_page_for_writepage() after we return
to writeback_sb_inodes() so this imprecision doesn't make a difference.

> In this case, XFS is skipping pages because it can't get the inode
> metadata lock without blocking on it, and we are in non-blocking
> mode because we are doing WB_SYNC_NONE writeback. We could also
> check for wbc->tagged_writepages, but nobody else does, nor does it
> fix the problem of calling redirty_page_for_writepage() in
> WB_SYNC_ALL conditions. None of the filesystems put writeback
> control specific contraints on their calls to
> redirty_page_for_writepage() and so it seems to me like it's a
> generic issue, not just an XFS issue.
> 
> Digging deeper, it looks to me like our sync code simply does not
> handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> properly.
  Well, there are two different things:
a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
we are in trouble because definition of that writeback is that it must
write everything. So I would consider that a fs bug (we could put a WARN_ON
for this into redirty_page_for_writepage()). Arguably, we could be nice to
filesystems and instead of warning just retry writeback indefinitely but
unless someone comes with a convincing usecase I'd rather avoid that.

b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
a different matter. There it is clearly allowed and writeback code must
handle that gracefully.

> It looks to me like requeue_inode should never rewrite
> the timestamp of the inode if we skipped pages, because that means
> we didn't write everything we were supposed to for WB_SYNC_ALL or
> wbc->tagged_writepages writeback. Further, if there are skipped
> pages we should be pushing the inode to b_more_io, not b_dirty so as
> to do another pass on the inode to ensure we writeback the skipped
> pages in this writeback pass regardless of the WB_SYNC flags or
> wbc->tagged_writepages field.
  Resetting timestamp in requeue_inode() is one thing which causes problems
but even worse seems the redirty_tail() call which also updates the
i_dirtied_when timestamp. So any call to redirty_tail() will effectively
exclude the inode from running sync(2) writeback and that's wrong. Either
I'll quickly find a way for redirty_tail() to not clobber i_dirtied_when or
I'll ask Linus to revert the above commit so that we have more time for
thinking...

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html