On 6/8/22 12:02 PM, Matthew Wilcox wrote: > On Wed, Jun 08, 2022 at 10:17:33AM -0700, Stefan Roesch wrote: >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >> index b06a5c24a4db..f701dcb7c26a 100644 >> --- a/fs/iomap/buffered-io.c >> +++ b/fs/iomap/buffered-io.c >> @@ -829,7 +829,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i) >> length -= status; >> } while (iov_iter_count(i) && length); >> >> - return written ? written : status; >> + if (status == -EAGAIN) { >> + iov_iter_revert(i, written); >> + return -EAGAIN; >> + } >> + if (written) >> + return written; >> + return status; >> } > > I still don't understand how this can possibly work. Walk me through it. > > Let's imagine we have a file laid out such that extent 1 is bytes > 0-4095 of the file and extent 2 is extent 4096-16385 of the file. > We do a write of 5000 bytes starting at offset 4000 of the file. > > iomap_iter() tells us about the first extent and we write the first > 96 bytes of our data to the first extent, returning 96. iomap_iter() > tells us about the second extent, and we write the next 4000 bytes to > the second extent. Then we get a page fault and get to the -EAGAIN case. > We rewind the iter 4000 bytes. > We have two data structures, the iomap_iter and iov_iter. After the first 96 bytes, the iov_iter offset get updated in iomap_write_iter() and then the iomap_iter pos gets updated in iomap_iter()->iomap_iter_advance(). We then get the second extend from iomap_iter(). In iomap_write_iter() the first page is obtained and written successfully, then the second page is faulted. At this point the iov offset of the iov_iter has advanced. To reset it to the state when the function iomap_write_iter() was entered, the iov_iter is reset to iov_offset - written bytes. iomap_write_iter() is exited and returns -EAGAIN. As iomap_write_iter() returns an error, the iomap_iter pos is not updated in iomap_iter(). Only the number of bytes written in the write of the first extent from iomap_file_buffered_write() is returned from iomap_file_buffered_write(). In xfs_file_buffered_write we updated the iocb->ki_pos with the number of bytes written. In io-uring, the io_write() call receives the short write result. It copies the iov_iter struct into the work context for the io worker. The io_worker uses that information to complete the rest of the write. The above reset is required to keep the pos in iomap_iter and the offset in iov_iter in sync. Side Note: I had an earlier version of the patch that was changing the signature of the function iomap_write_iter(). It was returning a return code and changing the processed value of the iomap_iter (which then also changes the pos value of the iomap_iter). This version (version 7 of the patch) does not require to reset the offset of the iov_iter. It can update the pos in iomap_iter even when -EAGAIN is returned. > How do we not end up writing garbage when the kworker does the retry? > I'd understand if we rewound the iter all the way to the start. Or if > we didn't rewind the iter at all and were able to pick up partway through > the write. But rewinding to the start of the extent feels like it can't > possibly work.