On Wed 19-02-14 00:29:24, Dave Chinner wrote: > On Tue, Feb 18, 2014 at 10:38:20AM +0100, Jan Kara wrote: > > On Tue 18-02-14 11:23:12, Dave Chinner wrote: > > > > > In this case, XFS is skipping pages because it can't get the inode > > > > > metadata lock without blocking on it, and we are in non-blocking > > > > > mode because we are doing WB_SYNC_NONE writeback. We could also > > > > > check for wbc->tagged_writepages, but nobody else does, nor does it > > > > > fix the problem of calling redirty_page_for_writepage() in > > > > > WB_SYNC_ALL conditions. None of the filesystems put writeback > > > > > control specific contraints on their calls to > > > > > redirty_page_for_writepage() and so it seems to me like it's a > > > > > generic issue, not just an XFS issue. > > > > > > > > > > Digging deeper, it looks to me like our sync code simply does not > > > > > handle redirty_page_for_writepage() being called in WB_SYNC_ALL > > > > > properly. > > > > Well, there are two different things: > > > > a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback, > > > > we are in trouble because definition of that writeback is that it must > > > > write everything. So I would consider that a fs bug (we could put a WARN_ON > > > > for this into redirty_page_for_writepage()). Arguably, we could be nice to > > > > filesystems and instead of warning just retry writeback indefinitely but > > > > unless someone comes with a convincing usecase I'd rather avoid that. > > > > > > Right, that might be true, but almost all .writepages > > > implementations unconditionally call redirty_page_for_writepage() in > > > certain circumstances. e.g. xfs/ext4/btrfs do it when called from > > > direct reclaim context to avoid the possibility of stack overruns. > > > ext4 does it unconditionally when a memory allocation fails, etc. > > > > > > So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in > > > all conditions, and quite frankly I'd prefer that we fail a > > > WB_SYNC_ALL writeback than risk a stack overrun. Currently we are > > > stuck between a rock and a hard place with that. > > OK, I agree that returning error from sync / fsync in some rare corner > > cases is better than crashing the kernel. Reclaim shouldn't be an issue as > > that does only WB_SYNC_NONE writeback but out of memory conditions are real > > for WB_SYNC_ALL writeback. > > > > Just technically that means we have to return some error code from > > ->writepage() / ->writepages() for WB_SYNC_ALL writeback while we have to > > silently continue for WB_SYNC_NONE writeback. That will require some > > tweaking within filesystems. > > > > > > b) Calling redirty_page_for_writepage() for tagged_writepages writeback is > > > > a different matter. There it is clearly allowed and writeback code must > > > > handle that gracefully. > > > > > > *nod* > > > > > > > > It looks to me like requeue_inode should never rewrite > > > > > the timestamp of the inode if we skipped pages, because that means > > > > > we didn't write everything we were supposed to for WB_SYNC_ALL or > > > > > wbc->tagged_writepages writeback. Further, if there are skipped > > > > > pages we should be pushing the inode to b_more_io, not b_dirty so as > > > > > to do another pass on the inode to ensure we writeback the skipped > > > > > pages in this writeback pass regardless of the WB_SYNC flags or > > > > > wbc->tagged_writepages field. > > > > Resetting timestamp in requeue_inode() is one thing which causes problems > > > > but even worse seems the redirty_tail() call which also updates the > > > > i_dirtied_when timestamp. So any call to redirty_tail() will effectively > > > > exclude the inode from running sync(2) writeback and that's wrong. > > > > > > *nod* > > > > > > I missed that aspect of the redirty_tail() behaviour, too. Forest, > > > trees. This aspect of the problem may be more important than the > > > problem with skipped pages.... > > redirty_tail() behavior is a pain for a long time. But we cannot just rip > > it out because we need a way to tell "try to writeback this inode later" > > where later should be "significantly later" - usually writeback on that > > inode is blocked by some other IO, lock, or something similar. So without > > redirty_tail() we just spinned in writeback code for significant time > > busywaiting for IO or a lock. I actually have patches to remove > > redirty_tail() from like two years ago but btrfs was particularly inventive > > in screwing up writeback back then so we didn't merge the patches in the end. > > Maybe it's time to revisit this. > > OK, I suspect that there are oter problem lurking here, too. I just > hit a problem on generic/068 on a ramdisk on XFS where a sync call > would never complete until the writer processes were killed. fstress > got stuck here: > > [222229.551097] fsstress D ffff88021bc13180 4040 5898 5896 0x00000000 > [222229.551097] ffff8801e5c2dd68 0000000000000086 ffff880219eb1850 0000000000013180 > [222229.551097] ffff8801e5c2dfd8 0000000000013180 ffff88011b2b0000 ffff880219eb1850 > [222229.551097] ffff8801e5c2dd48 ffff8801e5c2de68 ffff8801e5c2de70 7fffffffffffffff > [222229.551097] Call Trace: > [222229.551097] [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20 > [222229.551097] [<ffffffff81ce35e9>] schedule+0x29/0x70 > [222229.551097] [<ffffffff81ce28c1>] schedule_timeout+0x171/0x1d0 > [222229.551097] [<ffffffff810b0eda>] ? __queue_delayed_work+0x9a/0x170 > [222229.551097] [<ffffffff810b0b41>] ? try_to_grab_pending+0xc1/0x180 > [222229.551097] [<ffffffff81ce434f>] wait_for_completion+0x9f/0x110 > [222229.551097] [<ffffffff810c7810>] ? try_to_wake_up+0x2c0/0x2c0 > [222229.551097] [<ffffffff811d3c4a>] sync_inodes_sb+0xca/0x1f0 > [222229.551097] [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20 > [222229.551097] [<ffffffff811db94c>] sync_inodes_one_sb+0x1c/0x20 > [222229.551097] [<ffffffff811af219>] iterate_supers+0xe9/0xf0 > [222229.551097] [<ffffffff811dbb32>] sys_sync+0x42/0xa0 > [222229.551097] [<ffffffff81cf0d29>] system_call_fastpath+0x16/0x1b > > This then held off the filesystem freeze due to holding s_umount, > and the two fstest processes just kept running dirtying the > filesystem. It wasn't until I kill the fstests processes by removing > the tmp file that the sync completed and the test made progress. OK, so flusher thread (or actually the corresponding kworker) was continuously writing the newly dirtied data? So far I didn't reproduce this but I'll try... > It's reproducable, and I left it for a couple of hours to see if > would resolve itself. It didn't, so I had to kick it to break the > livelock. I wonder whether it might be some incarnation of a bug fixed here: https://lkml.org/lkml/2014/2/14/733 The effects should be somewhat different but it's in that area. Can you try with that patch? Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html