Re: Iomap buffered write short copy handling (with full folio uptodate)

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 21 Mar 2025 19:57:01 +1100

On Fri, Mar 21, 2025 at 06:42:25PM +1030, Qu Wenruo wrote:
> Hi,
> 
> I'm wondering if the current iomap short copy handler can handle the
> following case correctly:
> 
> The fs block size is 4K, page size is 4K, the buffered write is into
> file range [0, 4K), the fs is always doing data COW.
> 
> The folio at file offset 0 is already uptodate, and the folio size is
> also 4K.
> 
> - ops->iomap_begin() got called for the range [0, 4K) from iomap_iter()
>   The fs reserved space of one block of data, and some extra metadata
>   space.
> 
> - copy_folio_from_iter_atomic() only copied 1K bytes
> 
> - iomap_write_end() returned true
>   Since the folio is already uptodate, we can handle the short copy.
>   The folio is marked dirty and uptodate.
> 
> - __iomap_put_folio() unlocked and put the folio
> 
> - Now a writeback was triggered for that folio at file offset 0
>   The folio got properly written to disk.
> 
>   But remember we have only reserved one block of data space, and that
>   reserved space is consumed by this writeback.

This bumps the internal inode mapping generation number....

>   What's worse is, the fs can even do a snapshot of that involved inode,
>   so that the current copy of that 1K short-written block will not be
>   freed.
> 
> - copy_folio_from_iter_atomic() copied the remaining 3K bytes

No, we don't get that far. iomap_begin_write() calls
__iomap_get_folio() to get and lock the folio again, then calls
folio_ops->iomap_valid() to check that the iomap is still valid.

In the above case, the cookie in the iomap (the mapping generation
number at the time the iomap was created by ->iomap_begin) won't
match the current inode mapping generation number as it was bumped
on writeback.

Hence the iomap is marked IOMAP_F_STALE, the current write is
aborted before it starts, then iomap_write_iter() sees IOMAP_F_STALE
and restarts the write again.

We then get a new mapping from ops->iomap_begin() with a new 1 block
reservation for the remaining 3kB of data to be copied into that
block.

i.e. iomaps are cached information, and we have to validate that the
mapping has not changed once we have all the objects we are about to
modify locked and ready for modification.

>   All these happens inside the do {} while () loop of
>   iomap_write_iter(), thus no iomap_begin() callback can be triggered to
>   allocate extra space.
> 
> - __iomap_put_folio() unlocked and put the folio 0 again.
> 
> - Now a writeback got started for that folio at file offset 0 again
>   This requires another free data block from the fs.
> 
> In that case, iomap_begin() only reserved one block of data.
> But in the end, we wrote 2 blocks of data due to short copy.
> 
> I'm wondering what's the proper handling of short copy during buffered
> write.

> Is there any special locking I missed preventing the folio from being
> written back halfway?

Not locking, just state validation and IOMAP_F_STALE. i.e.
filesystems that use delalloc or cow absolutely need to implement
folio_ops->iomap_valid() to detect stale iomaps....

> Or is it just too hard to trigger such case in the real world?

Triggered it, we certainly did. It caused data corruption and took
quite some time to triage and understand.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx