On Wed, Sep 02, 2020 at 08:42:25AM -0400, Josef Bacik wrote: > On 9/2/20 8:20 AM, Dave Chinner wrote: > > On Wed, Sep 02, 2020 at 12:44:14PM +0100, Matthew Wilcox wrote: > > > On Wed, Sep 02, 2020 at 09:58:30AM +1000, Dave Chinner wrote: > > > > Put simply: converting a filesystem to use iomap is not a "change > > > > the filesystem interfacing code and it will work" modification. We > > > > ask that filesystems are modified to conform to the iomap IO > > > > exclusion model; adding special cases for every potential > > > > locking and mapping quirk every different filesystem has is part of > > > > what turned the old direct IO code into an unmaintainable nightmare. > > > > > > > > > That's fine, but this is kind of a bad way to find > > > > > out. We really shouldn't have generic helper's that have different generic > > > > > locking rules based on which file system uses them. > > > > > > > > We certainly can change the rules for new infrastructure. Indeed, we > > > > had to change the rules to support DAX. The whole point of the > > > > iomap infrastructure was that it enabled us to use code that already > > > > worked for DAX (the XFS code) in multiple filesystems. And as people > > > > have realised that the DIO via iomap is much faster than the old DIO > > > > code and is a much more efficient way of doing large buffered IO, > > > > other filesystems have started to use it. > > > > > > > > However.... > > > > > > > > > Because then we end up > > > > > with situations like this, where suddenly we're having to come up with some > > > > > weird solution because the generic thing only works for a subset of file > > > > > systems. Thanks, > > > > > > > > .... we've always said "you need to change the filesystem code to > > > > use iomap". This is simply a reflection on the fact that iomap has > > > > different rules and constraints to the old code and so it's not a > > > > direct plug in replacement. There are no short cuts here... > > > > > > Can you point me (and I suspect Josef!) towards the documentation of the > > > locking model? I was hoping to find Documentation/filesystems/iomap.rst > > > but all the 'iomap' strings in Documentation/ refer to pci_iomap and > > > similar, except for this in the DAX documentation: > > > > There's no locking model documentation because there is no locking > > in the iomap direct IO code. The filesystem defines and does all the > > locking, so there's pretty much nothing to document for iomap. > > > > IOWs, the only thing iomap_dio_rw requires is that the IO completion > > paths do not take same locks that the IO submission path > > requires. And that's because: > > > > /* > > * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO > > * is being issued as AIO or not. [...] > > > > So you obviously can't sit waiting for dio completion in > > iomap_dio_rw() while holding the submission lock if completion > > requires the submission lock to make progress. > > > > FWIW, iomap_dio_rw() originally required the inode_lock() to be held > > and contained a lockdep assert to verify this, but.... > > > > commit 3ad99bec6e82e32fa9faf2f84e74b134586b46f7 > > Author: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx> > > Date: Sat Nov 30 09:59:25 2019 -0600 > > > > iomap: remove lockdep_assert_held() > > Filesystems such as btrfs can perform direct I/O without holding the > > inode->i_rwsem in some of the cases like writing within i_size. So, > > remove the check for lockdep_assert_held() in iomap_dio_rw(). > > Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx> > > Signed-off-by: David Sterba <dsterba@xxxxxxxx> > > > > ... btrfs has special corner cases for direct IO locking and hence > > we removed the lockdep assert.... > > > > IOWs, iomap_dio_rw() really does not care what strategy filesystems > > use to serialise DIO against other operations. Filesystems can use > > whatever IO serialisation mechanism they want (mutex, rwsem, range > > locks, etc) as long as they obey the one simple requirement: do not > > take the DIO submission lock in the DIO completion path. > > > > Goldwyn has been working on these patches for a long time, and is actually > familiar with this code, and he missed that these two interfaces are being > mixed. This is a problem that I want to solve. He didn't notice it in any > of his testing, which IIRC was like 6 months to get this stuff actually into > the btrfs tree. And that "been working on it for 6 months" is what really worries me the most. Hence I've done some post-mortem analysis of the "inode_lock" deadlock situation. I decided to do this after I realised this same patch set was merged last cycle and reverted late in the cycle because of problems it was found to have with page invalidation when mixed IO types were used. I did a bit of looking around yesterday afternoon, both in the btrfs code to understand the iomap_dio_rw changes and at our test coverage for DIO functionality. The more I dug, the more I'm finding that the most important question we need to answer is not "why wasn't this iomap locking requirement documented", the question we should be trying to answer is this: Why wasn't this *obvious* deadlock detected months ago? First of all, the inode-lock() complaints are largely a red herring as the typical btrfs DIO write path through the patchset does not take the inode lock and hence it is actually "safe" to take the inode lock in the completion path in the common case. i.e. for DIO writes wholly inside EOF, submission drops the inode_lock() before calling iomap_dio_rw() and so there is no inode_lock() deadlock present. From that view point, it looks like there's only a specific corner case where this deadlock might occur and that would explain why it hasn't been discovered until now. However. The new btrfs dio write path always does down_read(BTRFS_I()->dio_sem), even when the inode lock has not been taken. That's important because btrfs_sync_file() always does a down_write(&BTRFS_I(inode)->dio_sem) call and will deadlock iin IO completion if the IO submission code is holding the dio_sem. So regardless of the inode_lock(), non-AIO O_DSYNC DIO writes through iomap_dio_rw() should deadlock immediately on the first IO with this locking arrangement. It will deadlock on either the inode_lock or the dio_sem, depending on whether the ODSYNC DIO write is wholly within EOF or not, but the deadlock is guaranteed to occur. Hence we can completely ignore the "inode_lock vs fsync" side show, because other btrfs internal locks will trigger the same issue. If this is correct, then how did this "should happen instantly" bug go undiscovered for months? Well.... It appears that fstests has no coverage of non-AIO O_DSYNC DIO writes. Tools like fsx and fstress don't have O_SYNC, O_DSYNC or RWF_DSYNC modes and O_SYNC and O_DSYNC is not used by any of the common DIO test programs. xfs_io can open files O_SYNC and there's a test that uses xfs_io's RWF_DSYNC capability, but they all use buffered IO and so none of tests that open or write data synchronously use direct IO. The only conclusion I can make from thsi is that the case that should deadlock instantly isn't actually covered by fstests at all. This conclusion only stands up this O_DSYNC code path was only "tested" for feature coverage with fstests. However, it does imply that no pre-implementation audit was done to determine if fstests actually covered all the functionality that needed to be tested here.... I tend to use xfs_io and fio for DIO feature correctness testing long before I run fstests on new code. That's how I developed and tested the FUA optimisations - xfs_io and fio w/ RWF_DSYNC on XFS on iscsi - so I've never noticed that fstests doesn't actually exercise this syscall path directly. Granted, the problem was eventually discovered by fstests, but this also raised questions. The failing test was an AIO+DIO O_DSYNC test, but the trigger has been described as a "unusual event on a rarely tested configuration". That "unusual event" was an DIO completion being run from submission context because the IO completed before the submission had been finish. This is not actually unusual - it's how all AIO on synchronous IO devices complete. i.e. if you have a ram device or a (fake) pmem device, every single AIO will complete in this way as the "IO reference" held by submit_bio() is completed inside submit_bio() before it returns to the submission context. Hence the submission context always drops the last IO reference and completes the IO. Therefore running fstests on a ramdisk or (fake) pmem would have triggered this deadlock *instantly* on the first O_DSYNC AIO+DIO write that fstests issued. This implies that btrfs is rarely tested on fast synchronous storage devices despite ramdisks being available on every test machine that can run fstests. To provide a contrast, the iomap infrastructure is regularly tested on such devices - both Darrick and I have both have (fake) pmem test setups and exercise synchronous completion code paths like this on a daily basis. Hence it looks to me like the answer to the "why wasn't this found earlier" question is a combination of multiple factors: 1. fstests has no specific non-AIO O_DSYNC DIO write unit tests, nor do the stress tests allow use O_DSYNC or RWF_DSYNC. 2. No test coverage audit was performed prior to making a critical change to the btrfs IO path so this specific lack of coverage was not noticed until now. 3. after the first revert of this functionality, post-mortem analysis either wasn't performed or didn't identify process and/or test coverage issues that allowed serious issues in the patch set to go undiscovered. 4. tools and benchmarks that could have easily discovered the problem either weren't run or they weren't configured to test and exercise all the IO path features the change affected. 5. btrfs is not regularly tested on a variety of storage that have distinctly different IO path behaviours. > If we're going to mix interfaces then it should be > blatantly obvious to developers that's what's happening so the find out > during development, not after the patches have landed, and certainly not > after they've made it out to users. Thanks, As the above indicates, this issue _should_ have been blatantly obvious months ago and documentation would not change this fact. IOWs, even if the iomap requirement was documented and followed, a locking bug in the btrfs implementation would still not have been discovered until now because that's how long it took to actually exercise the buggy code path and expose it. So, yeah, the lack of documentation contributed to the bug being present. But taking 6 months to actually exercise the new code containing the bug is most definitely not an interface documentation problem, nor a problem that can be fixed by correcting the interface documentation.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx