On Mon, Feb 17, 2014 at 04:16:42PM +0100, Jan Kara wrote: > Hi Dave, > > On Mon 17-02-14 15:40:47, Dave Chinner wrote: > > I just had a tester report to me that he'd bisected an XFS assert > > failure on unmount to this commit: > > > > # git bisect bad > > c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit > > commit c4a391b53a72d2df4ee97f96f78c1d5971b47489 > > Author: Jan Kara <jack@xxxxxxx> > > Date: Tue Nov 12 15:07:51 2013 -0800 > > > > writeback: do not sync data dirtied after sync start > > > > When there are processes heavily creating small files while sync(2) is > > running, it can easily happen that quite some new files are created > > between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen > > especially if there are several busy filesystems (remember that sync > > traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one > > fs before starting it on another fs). Because WB_SYNC_ALL pass is slow > > (e.g. causes a transaction commit and cache flush for each inode in > > ext3), resulting sync(2) times are rather large > > > > The XFS assert failure was that on unmount there were delayed > > allocation blocks accounted to an inode when it was being evicted. > > i.e. during the evict_inodes() call in generic_shutdown_super() after > > the filesystem had been synced. There should be no dirty data at > > this point in time.... > > > > The writeback code is a maze of twisty passages, so I'm not 100% > > sure of my reasoning - the call chain analysis I did is below so you > > can confirm I didn't miss anything. Assuming I got it right, > > however.... > > > > What it looks like is that there's a problem with requeue_inode() > > behaviour and redirty_page_for_writepage(). If write_cache_pages() > > runs out of it's writeback chunk, the inode is still dirty when it > > returns to writeback_sb_inodes(). This leads to requeue_inode() > > updating the dirtied_when field in the inode for WB_SYNC_NONE + > > wbc->tagged_writepages, then calling redirty_tail() because > > wbc->pages_skipped is set. That puts the inode back on b_dirty with > > a timestamp after the sync started, and so the WB_SYNC_ALL pass > > skips it, hence it doesn't get synced to disk even though it should. > > IOWs, sync_filesystem can silently fail to write dirty data to disk. > Yes, this analysis is basically correct. The only small imprecision is > that we cannot run out of writeback chunk when doing tagged_writepages > writeback - writeback_chunk_size() sets nr_to_write to LONG_MAX for such > writeback regardless of what was passed in the work item. But still the > inode will be dirty because of redirty_page_for_writepage() after we return > to writeback_sb_inodes() so this imprecision doesn't make a difference. Ah, right, I missed the writeback_chunk_size() behaviour. I knew there'd be something I missed. ;) > > In this case, XFS is skipping pages because it can't get the inode > > metadata lock without blocking on it, and we are in non-blocking > > mode because we are doing WB_SYNC_NONE writeback. We could also > > check for wbc->tagged_writepages, but nobody else does, nor does it > > fix the problem of calling redirty_page_for_writepage() in > > WB_SYNC_ALL conditions. None of the filesystems put writeback > > control specific contraints on their calls to > > redirty_page_for_writepage() and so it seems to me like it's a > > generic issue, not just an XFS issue. > > > > Digging deeper, it looks to me like our sync code simply does not > > handle redirty_page_for_writepage() being called in WB_SYNC_ALL > > properly. > Well, there are two different things: > a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback, > we are in trouble because definition of that writeback is that it must > write everything. So I would consider that a fs bug (we could put a WARN_ON > for this into redirty_page_for_writepage()). Arguably, we could be nice to > filesystems and instead of warning just retry writeback indefinitely but > unless someone comes with a convincing usecase I'd rather avoid that. Right, that might be true, but almost all .writepages implementations unconditionally call redirty_page_for_writepage() in certain circumstances. e.g. xfs/ext4/btrfs do it when called from direct reclaim context to avoid the possibility of stack overruns. ext4 does it unconditionally when a memory allocation fails, etc. So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in all conditions, and quite frankly I'd prefer that we fail a WB_SYNC_ALL writeback than risk a stack overrun. Currently we are stuck between a rock and a hard place with that. > b) Calling redirty_page_for_writepage() for tagged_writepages writeback is > a different matter. There it is clearly allowed and writeback code must > handle that gracefully. *nod* > > It looks to me like requeue_inode should never rewrite > > the timestamp of the inode if we skipped pages, because that means > > we didn't write everything we were supposed to for WB_SYNC_ALL or > > wbc->tagged_writepages writeback. Further, if there are skipped > > pages we should be pushing the inode to b_more_io, not b_dirty so as > > to do another pass on the inode to ensure we writeback the skipped > > pages in this writeback pass regardless of the WB_SYNC flags or > > wbc->tagged_writepages field. > Resetting timestamp in requeue_inode() is one thing which causes problems > but even worse seems the redirty_tail() call which also updates the > i_dirtied_when timestamp. So any call to redirty_tail() will effectively > exclude the inode from running sync(2) writeback and that's wrong. *nod* I missed that aspect of the redirty_tail() behaviour, too. Forest, trees. This aspect of the problem may be more important than the problem with skipped pages.... > Either > I'll quickly find a way for redirty_tail() to not clobber i_dirtied_when or > I'll ask Linus to revert the above commit so that we have more time for > thinking... No worries. Thanks for looking into the problem, Jan. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html