On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote: > On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@xxxxxxxxxx wrote: > > From: Filipe Manana <fdmanana@xxxxxxxx> > > > > Some users recently reported that MariaDB was getting a read corruption > > when using io_uring on top of btrfs. This started to happen in 5.16, > > after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults > > during direct IO reads and writes"). That changed btrfs to use the new > > iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling > > iomap_dio_rw(). This was necessary to fix deadlocks when the iovector > > corresponds to a memory mapped file region. That type of scenario is > > exercised by test case generic/647 from fstests, and it also affected > > gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL. > > > > For this MariaDB scenario, we attempt to read 16K from file offset X > > using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each > > with a size of 4K, and what happens is the following: > > > > 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw(); > > > > 2) iomap creates a struct iomap_dio object, its reference count is > > initialized to 1 and its ->size field is initialized to 0; > > > > 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the > > You mean btrfs_dio_iomap_begin()? Yes, correct. > > > first 4K extent, and setups an iomap for this extent consisting of > > a single page; > > So we have IOCB_NOWAIT, which means btrfs_dio_iomap_begin() is being > passed IOMAP_NOWAIT and so knows it is being asked > to map an extent for an IO that is on a non-blocking path. > > btrfs_dio_iomap_begin() doesn't appear to support NOWAIT semantics > at all - it will block doing writeback IO, memory allocation, extent > locking, transaction reservations, extent allocation, etc.... We do have some checks for NOWAIT before getting into btrfs_dio_iomap_begin(), but they are only for the write path, and they are incomplete. Some are a bit tricky to deal with, but yes, there's several cases that are either missing or need to be improved. > > That, to me, looks like the root cause of the problem here - > btrfs_dio_iomap_begin() is not guaranteeing non-blocking atomic IO > semantics for IOCB_NOWAIT IO. > > In the case above, given that the extent lookup only found a 4kB > extent, we know that it doesn't span the entire requested IO range. > We also known that we cannot tell if we'll block on subsequent > mappings of the IO range, and hence no guarantee can be given that > IOCB_NOWAIT IO will not block when it is too late to back out with a > -EAGAIN error. > > Hence this whole set of problems could be avoided if > btrfs_dio_iomap_begin() returns -EAGAIN if it can't map the entire > IO into a single extent without blocking when IOMAP_NOWAIT is set? > That's exactly what XFS does in xfs_direct_iomap_write_begin(): > > /* > * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with > * a single map so that we avoid partial IO failures due to the rest of > * the I/O range not covered by this map triggering an EAGAIN condition > * when it is subsequently mapped and aborting the I/O. > */ > if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) { > error = -EAGAIN; > if (!imap_spans_range(&imap, offset_fsb, end_fsb)) > goto out_unlock; > } > > Basically, I'm thinking that IOMAP_NOWAIT and IOMAP_DIO_PARTIAL > should be exclusive functionality - if you are doing IOMAP_NOWAIT > then the entire IO must succeed without blocking, and if it doesn't > then we return -EAGAIN and the caller retries without IOCB_NOWAIT > set and so then we run with IOMAP_DIO_PARTIAL semantics in a thread > that can actually block.... Indeed, I had not considered that, that is simple and effective, plus it can be done exclusively in btrfs code, no need to change iomap. > > ..... > > > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K > > and return 4K (the amount of io done) to iomap_dio_complete_work(); > > > > 12) iomap_dio_complete_work() calls the iocb completion callback, > > iocb->ki_complete() with a second argument value of 4K (total io > > done) and the iocb with the adjust ki_pos of X + 4K. This results > > in completing the read request for io_uring, leaving it with a > > result of 4K bytes read, and only the first page of the buffer > > filled in, while the remaining 3 pages, corresponding to the other > > 3 extents, were not filled; > > > > 13) For the application, the result is unexpected because if we ask > > to read N bytes, it expects to get N bytes read as long as those > > N bytes don't cross the EOF (i_size). > > Yeah, that's exactly the sort of problem we were having with XFS > with partial DIO completions due to needing multiple iomap iteration > loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the > above range check and aborts before we start... Interesting. > > > > > So fix this by making __iomap_dio_rw() assign true to the boolean variable > > 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some > > progress for a read and we have not crossed the EOF boundary. Do this even > > if the read has IOCB_NOWAIT set, as it's the only way to avoid providing > > an unexpected result to an application. > > That's highly specific and ultimately will be fragile, IMO. I'd much > prefer that *_iomap_begin_write() implementations simply follow > IOMAP_NOWAIT requirements to ensure that any DIO that needs multiple > mappings if punted to a context that can block... Yes, agreed. Thanks for your feedback Dave, it provided a really good insight into this problem (and others). > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx