Re: Iomap buffered write short copy handling (with full folio uptodate)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





在 2025/3/21 19:27, Dave Chinner 写道:
On Fri, Mar 21, 2025 at 06:42:25PM +1030, Qu Wenruo wrote:
Hi,

I'm wondering if the current iomap short copy handler can handle the
following case correctly:

The fs block size is 4K, page size is 4K, the buffered write is into
file range [0, 4K), the fs is always doing data COW.

The folio at file offset 0 is already uptodate, and the folio size is
also 4K.

- ops->iomap_begin() got called for the range [0, 4K) from iomap_iter()
   The fs reserved space of one block of data, and some extra metadata
   space.

- copy_folio_from_iter_atomic() only copied 1K bytes

- iomap_write_end() returned true
   Since the folio is already uptodate, we can handle the short copy.
   The folio is marked dirty and uptodate.

- __iomap_put_folio() unlocked and put the folio

- Now a writeback was triggered for that folio at file offset 0
   The folio got properly written to disk.

   But remember we have only reserved one block of data space, and that
   reserved space is consumed by this writeback.

This bumps the internal inode mapping generation number....

   What's worse is, the fs can even do a snapshot of that involved inode,
   so that the current copy of that 1K short-written block will not be
   freed.

- copy_folio_from_iter_atomic() copied the remaining 3K bytes

No, we don't get that far. iomap_begin_write() calls
__iomap_get_folio() to get and lock the folio again, then calls
folio_ops->iomap_valid() to check that the iomap is still valid.

In the above case, the cookie in the iomap (the mapping generation
number at the time the iomap was created by ->iomap_begin) won't
match the current inode mapping generation number as it was bumped
on writeback.

Hence the iomap is marked IOMAP_F_STALE, the current write is
aborted before it starts, then iomap_write_iter() sees IOMAP_F_STALE
and restarts the write again.

We then get a new mapping from ops->iomap_begin() with a new 1 block
reservation for the remaining 3kB of data to be copied into that
block.

i.e. iomaps are cached information, and we have to validate that the
mapping has not changed once we have all the objects we are about to
modify locked and ready for modification.

Thanks a lot!

Didn't notice the iomap_valid() handling is even involved.


   All these happens inside the do {} while () loop of
   iomap_write_iter(), thus no iomap_begin() callback can be triggered to
   allocate extra space.

- __iomap_put_folio() unlocked and put the folio 0 again.

- Now a writeback got started for that folio at file offset 0 again
   This requires another free data block from the fs.

In that case, iomap_begin() only reserved one block of data.
But in the end, we wrote 2 blocks of data due to short copy.

I'm wondering what's the proper handling of short copy during buffered
write.

Is there any special locking I missed preventing the folio from being
written back halfway?

Not locking, just state validation and IOMAP_F_STALE. i.e.
filesystems that use delalloc or cow absolutely need to implement
folio_ops->iomap_valid() to detect stale iomaps....

Got it, this also means a COW fs must implement that callback if using
iomap.

And this solution also has a good optimization where if no writeback
happened between the lock and unlock, we can just reuse the last
reserved space without extra allocation.

Let me explore if we can do a similar solution in btrfs, other than the
current always-re-allocate behavior.

Thanks,
Qu


Or is it just too hard to trigger such case in the real world?

Triggered it, we certainly did. It caused data corruption and took
quite some time to triage and understand.

-Dave.






[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux