Re: [RFC PATCH] btrfs: don't call btrfs_sync_file from iomap context

Filipe Manana <fdmanana@xxxxxxxxx> · Thu, 3 Sep 2020 10:49:36 +0100

On Thu, Sep 3, 2020 at 3:28 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Wed, Sep 02, 2020 at 08:42:25AM -0400, Josef Bacik wrote:
> > On 9/2/20 8:20 AM, Dave Chinner wrote:
> > > On Wed, Sep 02, 2020 at 12:44:14PM +0100, Matthew Wilcox wrote:
> > > > On Wed, Sep 02, 2020 at 09:58:30AM +1000, Dave Chinner wrote:
> > > > > Put simply: converting a filesystem to use iomap is not a "change
> > > > > the filesystem interfacing code and it will work" modification.  We
> > > > > ask that filesystems are modified to conform to the iomap IO
> > > > > exclusion model; adding special cases for every potential
> > > > > locking and mapping quirk every different filesystem has is part of
> > > > > what turned the old direct IO code into an unmaintainable nightmare.
> > > > >
> > > > > > That's fine, but this is kind of a bad way to find
> > > > > > out.  We really shouldn't have generic helper's that have different generic
> > > > > > locking rules based on which file system uses them.
> > > > >
> > > > > We certainly can change the rules for new infrastructure. Indeed, we
> > > > > had to change the rules to support DAX.  The whole point of the
> > > > > iomap infrastructure was that it enabled us to use code that already
> > > > > worked for DAX (the XFS code) in multiple filesystems. And as people
> > > > > have realised that the DIO via iomap is much faster than the old DIO
> > > > > code and is a much more efficient way of doing large buffered IO,
> > > > > other filesystems have started to use it.
> > > > >
> > > > > However....
> > > > >
> > > > > > Because then we end up
> > > > > > with situations like this, where suddenly we're having to come up with some
> > > > > > weird solution because the generic thing only works for a subset of file
> > > > > > systems.  Thanks,
> > > > >
> > > > > .... we've always said "you need to change the filesystem code to
> > > > > use iomap". This is simply a reflection on the fact that iomap has
> > > > > different rules and constraints to the old code and so it's not a
> > > > > direct plug in replacement. There are no short cuts here...
> > > >
> > > > Can you point me (and I suspect Josef!) towards the documentation of the
> > > > locking model?  I was hoping to find Documentation/filesystems/iomap.rst
> > > > but all the 'iomap' strings in Documentation/ refer to pci_iomap and
> > > > similar, except for this in the DAX documentation:
> > >
> > > There's no locking model documentation because there is no locking
> > > in the iomap direct IO code. The filesystem defines and does all the
> > > locking, so there's pretty much nothing to document for iomap.
> > >
> > > IOWs, the only thing iomap_dio_rw requires is that the IO completion
> > > paths do not take same locks that the IO submission path
> > > requires. And that's because:
> > >
> > > /*
> > >   * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
> > >   * is being issued as AIO or not. [...]
> > >
> > > So you obviously can't sit waiting for dio completion in
> > > iomap_dio_rw() while holding the submission lock if completion
> > > requires the submission lock to make progress.
> > >
> > > FWIW, iomap_dio_rw() originally required the inode_lock() to be held
> > > and contained a lockdep assert to verify this, but....
> > >
> > > commit 3ad99bec6e82e32fa9faf2f84e74b134586b46f7
> > > Author: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx>
> > > Date:   Sat Nov 30 09:59:25 2019 -0600
> > >
> > >      iomap: remove lockdep_assert_held()
> > >      Filesystems such as btrfs can perform direct I/O without holding the
> > >      inode->i_rwsem in some of the cases like writing within i_size.  So,
> > >      remove the check for lockdep_assert_held() in iomap_dio_rw().
> > >      Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > >      Signed-off-by: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx>
> > >      Signed-off-by: David Sterba <dsterba@xxxxxxxx>
> > >
> > > ... btrfs has special corner cases for direct IO locking and hence
> > > we removed the lockdep assert....
> > >
> > > IOWs, iomap_dio_rw() really does not care what strategy filesystems
> > > use to serialise DIO against other operations.  Filesystems can use
> > > whatever IO serialisation mechanism they want (mutex, rwsem, range
> > > locks, etc) as long as they obey the one simple requirement: do not
> > > take the DIO submission lock in the DIO completion path.
> > >
> >
> > Goldwyn has been working on these patches for a long time, and is actually
> > familiar with this code, and he missed that these two interfaces are being
> > mixed.  This is a problem that I want to solve.  He didn't notice it in any
> > of his testing, which IIRC was like 6 months to get this stuff actually into
> > the btrfs tree.
>
> And that "been working on it for 6 months" is what really worries me
> the most. Hence I've done some post-mortem analysis of the
> "inode_lock" deadlock situation. I decided to do this after I
> realised this same patch set was merged last cycle and reverted late in
> the cycle because of problems it was found to have with page
> invalidation when mixed IO types were used.
>
> I did a bit of looking around yesterday afternoon, both in the btrfs
> code to understand the iomap_dio_rw changes and at our test
> coverage for DIO functionality.
>
> The more I dug, the more I'm finding that the most important
> question we need to answer is not "why wasn't this iomap locking
> requirement documented", the question we should be trying to answer
> is this:
>
>         Why wasn't this *obvious* deadlock detected months ago?
>
> First of all, the inode-lock() complaints are largely a red herring
> as the typical btrfs DIO write path through the patchset does not
> take the inode lock and hence it is actually "safe" to take the
> inode lock in the completion path in the common case. i.e. for DIO
> writes wholly inside EOF, submission drops the inode_lock() before
> calling iomap_dio_rw() and so there is no inode_lock() deadlock
> present. From that view point, it looks like there's only a specific
> corner case where this deadlock might occur and that would explain
> why it hasn't been discovered until now.
>
> However.
>
> The new btrfs dio write path always does
> down_read(BTRFS_I()->dio_sem), even when the inode lock has not been
> taken.  That's important because btrfs_sync_file() always does a
> down_write(&BTRFS_I(inode)->dio_sem) call and will deadlock iin IO
> completion if the IO submission code is holding the dio_sem.

Taking a look at the latest integration branch (misc-next) after 2
weeks away, indeed the new iomap dio switch change still has that
problem.
I reported that in June:

https://lore.kernel.org/linux-btrfs/CAL3q7H4F9iQJy3tgwZrWOKwenAnnn7oSthQZUMEJ_vWx3WE3WQ@xxxxxxxxxxxxxx/

For me, with generic/113 the deadlock happens always.

>
> So regardless of the inode_lock(), non-AIO O_DSYNC DIO writes
> through iomap_dio_rw() should deadlock immediately on the first IO
> with this locking arrangement. It will deadlock on either the
> inode_lock or the dio_sem, depending on whether the ODSYNC DIO write
> is wholly within EOF or not, but the deadlock is guaranteed to
> occur. Hence we can completely ignore the "inode_lock vs fsync" side
> show, because other btrfs internal locks will trigger the same
> issue.
>
> If this is correct, then how did this "should happen instantly"
> bug go undiscovered for months?
>
> Well....  It appears that fstests has no coverage of non-AIO
> O_DSYNC DIO writes.  Tools like fsx and fstress don't have O_SYNC,
> O_DSYNC or RWF_DSYNC modes and O_SYNC and O_DSYNC is not used by any
> of the common DIO test programs. xfs_io can open files O_SYNC and
> there's a test that uses xfs_io's RWF_DSYNC capability, but they all
> use buffered IO and so none of tests that open or write data
> synchronously use direct IO.
>
> The only conclusion I can make from thsi is that the case that
> should deadlock instantly isn't actually covered by fstests at all.
> This conclusion only stands up this O_DSYNC code path was only
> "tested" for feature coverage with fstests. However, it does imply
> that no pre-implementation audit was done to determine if fstests
> actually covered all the functionality that needed to be tested
> here....
>
> I tend to use xfs_io and fio for DIO feature correctness testing
> long before I run fstests on new code.  That's how I developed and
> tested the FUA optimisations - xfs_io and fio w/ RWF_DSYNC on XFS on
> iscsi - so I've never noticed that fstests doesn't actually exercise
> this syscall path directly.
>
> Granted, the problem was eventually discovered by fstests, but this
> also raised questions. The failing test was an AIO+DIO O_DSYNC test,
> but the trigger has been described as a "unusual event on a rarely
> tested configuration".
>
> That "unusual event" was an DIO completion being run from submission
> context because the IO completed before the submission had been
> finish. This is not actually unusual - it's how all AIO on
> synchronous IO devices complete. i.e. if you have a ram device or a
> (fake) pmem device, every single AIO will complete in this way as
> the "IO reference" held by submit_bio() is completed inside
> submit_bio() before it returns to the submission context. Hence the
> submission context always drops the last IO reference and completes
> the IO.
>
> Therefore running fstests on a ramdisk or (fake) pmem would have
> triggered this deadlock *instantly* on the first O_DSYNC AIO+DIO
> write that fstests issued. This implies that btrfs is rarely tested
> on fast synchronous storage devices despite ramdisks being available
> on every test machine that can run fstests. To provide a contrast,
> the iomap infrastructure is regularly tested on such devices - both
> Darrick and I have both have (fake) pmem test setups and exercise
> synchronous completion code paths like this on a daily basis.
>
> Hence it looks to me like the answer to the "why wasn't this found
> earlier" question is a combination of multiple factors:
>
> 1. fstests has no specific non-AIO O_DSYNC DIO write unit tests, nor
> do the stress tests allow use O_DSYNC or RWF_DSYNC.
>
> 2. No test coverage audit was performed prior to making a critical
> change to the btrfs IO path so this specific lack of coverage was
> not noticed until now.
>
> 3. after the first revert of this functionality, post-mortem
> analysis either wasn't performed or didn't identify process and/or
> test coverage issues that allowed serious issues in the patch set to
> go undiscovered.
>
> 4. tools and benchmarks that could have easily discovered the
> problem either weren't run or they weren't configured to test
> and exercise all the IO path features the change affected.
>
> 5. btrfs is not regularly tested on a variety of storage that have
> distinctly different IO path behaviours.
>
> > If we're going to mix interfaces then it should be
> > blatantly obvious to developers that's what's happening so the find out
> > during development, not after the patches have landed, and certainly not
> > after they've made it out to users.  Thanks,
>
> As the above indicates, this issue _should_ have been blatantly
> obvious months ago and documentation would not change this fact.
> IOWs, even if the iomap requirement was documented and followed, a
> locking bug in the btrfs implementation would still not have been
> discovered until now because that's how long it took to actually
> exercise the buggy code path and expose it.
>
> So, yeah, the lack of documentation contributed to the bug being
> present. But taking 6 months to actually exercise the new code
> containing the bug is most definitely not an interface documentation
> problem, nor a problem that can be fixed by correcting the interface
> documentation....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”