Re: [PATCH] xfs: redirty eof folio on truncate to avoid filemap flush

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 4 Nov 2022 09:25:41 +1100

On Thu, Nov 03, 2022 at 10:53:16AM -0400, Brian Foster wrote:
> On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote:
> > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > > ---
> > > > 
> > > > Here's a quick prototype of "option 3" described in my previous mail.
> > > > This has been spot tested and confirmed to prevent the original stale
> > > > data exposure problem. More thorough regression testing is still
> > > > required. Barring unforeseen issues with that, however, I think this is
> > > > tentatively my new preferred option. The primary reason for that is it
> > > > avoids looking at extent state and is more in line with what iomap based
> > > > zeroing should be doing more generically.
> > > > 
> > > > Because of that, I think this provides a bit more opportunity for follow
> > > > on fixes (there are other truncate/zeroing problems I've come across
> > > > during this investigation that still need fixing), cleanup and
> > > > consolidation of the zeroing code. For example, I think the trajectory
> > > > of this could look something like:
> > > > 
> > > > - Genericize a bit more to handle all truncates.
> > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a
> > > >   unique implementation from zero range that does explicit zeroing
> > > >   instead of relying on pagecache truncate.
> > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination
> > > >   of iomap_zero_range() and the new iomap_truncate_page().
> > > > 
> > > 
> > > After playing with this and thinking a bit more about the above, I think
> > > I managed to come up with an iomap_truncate_page() prototype that DTRT
> > > based on this. Only spot tested so far, needs to pass iomap_flags to the
> > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
> > > other bugs/warts, etc. etc. This is just a quick prototype to
> > > demonstrate the idea, which is essentially to check dirty state along
> > > with extent state while under lock and transfer that state back to iomap
> > > so it can decide whether it can shortcut or forcibly perform the zero.
> > > 
> > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
> > > under lock and implies that the range is sub-block (single page).
> > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
> > > dirty, so perform the zero via buffered write regardless of extent
> > > state.
> > 
> > I'd much prefer we fix this in the iomap infrastructure - failing to
> > zero dirty data in memory over an unwritten extent isn't an XFS bug,
> > so we shouldn't be working around it in XFS like we did previously.
> > 
> 
> I agree, but that was the original goal from the start. It's easier said
> than done to just have iomap accurately write/skip the appropriate
> ranges..
> 
> > I don't think this should be call "IOMAP_TRUNC_PAGE", though,
> > because that indicates the caller context, not what we are asking
> > the internal iomap code to do. What we are really asking is for
> > iomap_zero_iter() to do is zero the page cache if it exists in
> > memory, otherwise ignore unwritten/hole pages.  Hence I think a name
> > like IOMAP_ZERO_PAGECACHE is more appropriate,
> > 
> 
> That was kind of the point for this prototype. The flag isn't just
> asking iomap to perform some generic write behavior. It also indicates
> this is a special snowflake mode with assumptions from the caller (i.e.,
> unflushed, sub-block/partial range) that facilitate forced zeroing
> internally.
> 
> I've since come to the conclusion that this approach is just premature.
> It really only does the right thing in this very particular case,
> otherwise there is potential for odd/unexpected behavior in sub-page
> blocksize scenarios. It could be made to work more appropriately
> eventually, but more thought and work is required and there's a jump in
> complexity that isn't required to fix the immediate performance problem
> and additional stale data exposure problems.
> 
> > > 
> > > Brian
> > > 
> > > --- 8< ---
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 91ee0b308e13..14a9734b2838 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > >  	loff_t written = 0;
> > >  
> > >  	/* already zeroed?  we're done. */
> > > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > > +	if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
> > > +	    !(srcmap->flags & IOMAP_F_TRUNC_PAGE))
> > >  		return length;
> > 
> > Why even involve the filesystem in this? We can do this directly
> > in iomap_zero_iter() with:
> > 
> > 	if ((srcmap->type == IOMAP_HOLE)
> > 		return;
> > 	if (srcmap->type == IOMAP_UNWRITTEN) {
> > 		if (!(iter->flags & IOMAP_ZERO_PAGECACHE))
> > 			return;
> > 		if (!filemap_range_needs_writeback(inode->i_mapping,
> > 			    iomap->offset, iomap->offset + iomap->length))
> > 			return;
> > 	}
> > 
> 
> This reintroduces the same stale data exposure race fixed by the
> original patch. folio writeback can complete and convert an extent
> reported as unwritten such that zeroing will not occur when it should.

That sounds the exact data corruption bug that is being fixed in
this xfs/iomap series here:

https://lore.kernel.org/linux-xfs/20221101003412.3842572-1-david@xxxxxxxxxxxxx/

> The writeback check either needs to be in the fs code where it can
> prevent writeback completion from converting extents (under ilock), or
> earlier in iomap such that we're guaranteed to see either writeback
> state or the converted extent.

Or we catch stale iomaps once we've locked the page we are going to
modify and remap the extent before continuing the modification
operation.

> Given the above, I'm currently putting the prototype shown below through
> some regression testing. It lifts the writeback check into iomap and
> issues a flush and retry of the current iteration from iomap_zero_iter()
> when necessary.

Hmmm. It looks exactly like you are trying to reinvent the patchset
I've been working on for the past few weeks...

[...]

> I have some patches around that demonstrate such things, but given the
> current behavior of an unconditional flush and so far only seeing one
> user report from an oddball pattern, I don't see much need for
> additional hacks at the moment.
> 
> Thoughts? IMO, the iter retry and writeback check are both wonky and
> this is something that should first go into ->iomap_begin(), and only
> lift into iomap after a period of time. That's just my .02. I do plan to
> clean up with a retry helper and add comments and whatnot if this
> survives testing and/or any functional issues are worked out.

I think I've already solved all these problems with the above
IOMAP_F_STALE infrastructure in the above patchset. If that's the
case, then it seems like a no-brainer to me to base the fix for this
problem on the IOMAP_F_STALE capability we will soon have in this
code...

> index 91ee0b308e13..649d94ad3808 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -894,17 +894,28 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +	struct inode *inode = iter->inode;
>  	loff_t pos = iter->pos;
>  	loff_t length = iomap_length(iter);
>  	loff_t written = 0;
> +	int status;
>  
>  	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> -		return length;
> +	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> +		if (!(srcmap->flags & IOMAP_F_DIRTY_CACHE))
> +			return length;
> +
> +		status = filemap_write_and_wait_range(inode->i_mapping, pos,
> +						      pos + length - 1);
> +		if (status)
> +			return status;
> +		/* XXX: hacked up iter retry */
> +		iter->iomap.length = 0;
> +		return 1;

Yup, the IOMAP_F_STALE detection in iomap_write_begin() will detect
writeback races w/ unwritten extents that change the extent state
without needing to actually do physical writeback here.

IOWs, I think the fix here is not to skip IOMAP_UNWRITTEN extents up
front, but to allow zeroing to detect data in the page cache over
unwritten extents naturally.

i.e. if the iomap type is unwritten and we are zeroing, then
iomap_write_begin() drops the FGP_CREAT flag. This means the folio
lookup will only returns a folio if there was already a cached folio
in the page cache. If there is a cached folio, it will do iomap
validation and detect writeback races, marking the iomap stale. If
there is no folio or the folio returned is not dirty, then
iomap_zero_iter just moves on to checking the next folio in the
range...

> @@ -71,10 +76,16 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
>  	if (ret <= 0)
>  		return ret;
>  
> +	if ((iter->flags & IOMAP_ZERO) &&
> +	    filemap_range_needs_writeback(iter->inode->i_mapping, iter->pos,
> +					  iter->pos + iter->len - 1)) {
> +		iomap_flags |= IOMAP_F_DIRTY_CACHE;
> +	}

I really don't think we need this - I think we can do the page cache
state check and the iomap stale check directly in the
iomap_zero_iter() loop as per above...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx