Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 4 Dec 2024 08:12:05 +1100

On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote:
> On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> > can be instantiated over the range whilst we are running
> > xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> > are guaranteed that there will be no IO in progress or can be
> > started over the range we are removing.
> > 
> > Really, the issue is that writeback mappings have to be able to
> > handle the range being mapped suddenly appear to be beyond EOF.
> > This behaviour is a longstanding writeback constraint, and is what
> > iomap_writepage_handle_eof() is attempting to handle.
> > 
> > We handle this by only sampling i_size_read() whilst we have the
> > folio locked and can determine the action we should take with that
> > folio (i.e. nothing, partial zeroing, or skip altogether). Once
> > we've made the decision that the folio is within EOF and taken
> > action on it (i.e. moved the folio to writeback state), we cannot
> > then resample the inode size because a truncate may have started
> > and changed the inode size.
> > 
> > We have to complete the mapping of the folio to disk blocks - the
> > disk block mapping is guaranteed to be valid for the life of the IO
> > because the folio is locked and under writeback - and submit the IO
> > so that truncate_pagecache() will unblock and invalidate the folio
> > when the IO completes.
> > 
> > Hence writeback vs truncate serialisation is really dependent on
> > only sampling the inode size -once- whilst the dirty folio we are
> > writing back is locked.
> > 
> 
> Not sure I see how this is a serialization dependency given that
> writeback completion also samples i_size.

Ah, I didn't explain what I meant very clearly, did I?

What I mean was we can't sample i_size in the IO path without
specific checking/serialisation against truncate operations. And
that means once we have partially zeroed the contents of a EOF
straddling folio, we can't then sample the EOF again to determine
the length of valid data in the folio as this can race with truncate
and result in a different size for the data in the folio than we
prepared it for.

> But no matter, it seems a
> reasonable implementation to me to make the submission path consistent
> in handling eof.

Yes, the IO completion path does sample it again via xfs_new_eof().
However, as per above, it has specific checking for truncate down
races and handles them:

/*
 * If this I/O goes past the on-disk inode size update it unless it would
 * be past the current in-core inode size.
 */
static inline xfs_fsize_t
xfs_new_eof(struct xfs_inode *ip, xfs_fsize_t new_size)
{
        xfs_fsize_t i_size = i_size_read(VFS_I(ip));

>>>>    if (new_size > i_size || new_size < 0)
>>>>            new_size = i_size;
        return new_size > ip->i_disk_size ? new_size : 0;
}

If we have a truncate_setsize() called for a truncate down whilst
this IO is in progress, then xfs_new_eof() will see the new, smaller
inode isize. The clamp on new_size handles this situation, and we
then only triggers an update if the on-disk size is still smaller
than the new truncated size (i.e. the IO being completed is still
partially within the new EOF from the truncate down).

So I don't think there's an issue here at all at IO completion;
it handles truncate down races cleanly...

> I wonder if this could just use end_pos returned from
> iomap_writepage_handle_eof()?

Yeah, that was what I was thinking, but I haven't looked at the code
for long enough to have any real idea of whether that is sufficient
or not.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx