On Tue, Jun 26, 2018 at 08:27:23AM +1000, Dave Chinner wrote: > On Fri, Jun 22, 2018 at 10:41:10AM -0400, Brian Foster wrote: > > On Fri, Jun 22, 2018 at 01:09:41PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > > We've recently seen a workload on XFS filesystems with a repeatable > > > deadlock between background writeback and a multi-process > > > application doing concurrent writes and fsyncs to a small range of a > > > file. > > [...] > > > > #2 is simple, and I don't think it will have any impact on > > > performance as going back to the start of the file implies an > > > immediate seek. We'll have exactly the same number of seeks if we > > > switch writeback to another inode, and then come back to this one > > > later and restart from index 0. > > > > > > #2a is pretty much "status quo without the deadlock". Moving the > > > retry loop up into the wcp caller means we can issue IO on the > > > pending pages before calling wcp again, and so avoid locking or > > > waiting on pages in the wrong order. I'm not convinced we need to do > > > this given that we get the same thing from #2 on the next writeback > > > call from the writeback infrastructure. > > > > > > #3 is really just a band-aid - it doesn't fix the access/wait inversion > > > problem, just prevents it from becoming a deadlock situation. I'd > > > prefer we fix the inversion, not sweep it under the carpet like > > > this. > > > > > > #3a is really an optimisation that just so happens to include the > > > band-aid fix of #3. > > > > > > So I'm really tending towards #2 as the simplest way to fix this, > > > and that's what's the patch below implements. > > > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > > --- > > > > FWIW, this seems like a reasonable approach to me. One thing I'm not > > sure about is whether the higher level plug in wb_writeback() could > > cause the same problem even with the lower level cycle restart out of > > the picture. > > Plugging can't cause this because it captures bios that have > been released from the caller context vi submit_bio(). The plug list > has hooks in the scheduler that flush it on context switch precisely > so that we don't get deadlock problems with bios being stuck on the > plug list when we block for some reason. > Ah, right.. > > It looks to me that if the inode still has dirty pages after the > > write_cache_pages(), it ends up on wb->b_more_io via > > writeback_sb_inodes() -> requeue_inode(). The caller (wb_writeback()) > > repopulates ->b_io from ->b_more_io (via queue_io()) if the former is > > empty (and we haven't otherwise stopped) before finishing the plug. > > There is a blk_flush_plug() in writeback_sb_inodes(), but that appears > > to be non-deterministic. That call aside, could that plug effectively > > hold the page in writeback long enough for background writeback to spin > > around and sit on the page 1 lock? > > Right, that could happen, but the plug list will be flushed before > we context switch away and sleep after failng to get the page lock, > so there's no problem here. > Got it, thanks! Brian > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx