Iomap buffered write short copy handling (with full folio uptodate)

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Fri, 21 Mar 2025 18:42:25 +1030

Hi,

I'm wondering if the current iomap short copy handler can handle the
following case correctly:

The fs block size is 4K, page size is 4K, the buffered write is into
file range [0, 4K), the fs is always doing data COW.

The folio at file offset 0 is already uptodate, and the folio size is
also 4K.

- ops->iomap_begin() got called for the range [0, 4K) from iomap_iter()
  The fs reserved space of one block of data, and some extra metadata
  space.

- copy_folio_from_iter_atomic() only copied 1K bytes

- iomap_write_end() returned true
  Since the folio is already uptodate, we can handle the short copy.
  The folio is marked dirty and uptodate.

- __iomap_put_folio() unlocked and put the folio

- Now a writeback was triggered for that folio at file offset 0
  The folio got properly written to disk.

  But remember we have only reserved one block of data space, and that
  reserved space is consumed by this writeback.

  What's worse is, the fs can even do a snapshot of that involved inode,
  so that the current copy of that 1K short-written block will not be
  freed.

- copy_folio_from_iter_atomic() copied the remaining 3K bytes
  All these happens inside the do {} while () loop of
  iomap_write_iter(), thus no iomap_begin() callback can be triggered to
  allocate extra space.

- __iomap_put_folio() unlocked and put the folio 0 again.

- Now a writeback got started for that folio at file offset 0 again
  This requires another free data block from the fs.

In that case, iomap_begin() only reserved one block of data.
But in the end, we wrote 2 blocks of data due to short copy.

I'm wondering what's the proper handling of short copy during buffered
write.

Is there any special locking I missed preventing the folio from being
written back halfway?
Or is it just too hard to trigger such case in the real world?

Thanks,
Qu