Re: [PATCH 2/2] xfs: use iomap_valid method to detect stale cached iomaps

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 27 Sep 2022 21:54:27 -0700

On Fri, Sep 23, 2022 at 10:04:03AM +1000, Dave Chinner wrote:
> On Wed, Sep 21, 2022 at 08:44:01PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 21, 2022 at 06:29:59PM +1000, Dave Chinner wrote:
> > >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > @@ -1160,13 +1181,20 @@ xfs_buffered_write_iomap_end(
> > >  
> > >  	/*
> > >  	 * Trim delalloc blocks if they were allocated by this write and we
> > > -	 * didn't manage to write the whole range.
> > > +	 * didn't manage to write the whole range. If the iomap was marked stale
> > > +	 * because it is no longer valid, we are going to remap this range
> > > +	 * immediately, so don't punch it out.
> > >  	 *
> > > -	 * We don't need to care about racing delalloc as we hold i_mutex
> > > +	 * XXX (dgc): This next comment and assumption is totally bogus because
> > > +	 * iomap_page_mkwrite() runs through here and it doesn't hold the
> > > +	 * i_rwsem. Hence this whole error handling path may be badly broken.
> > 
> > That probably needs fixing, though I'll break that out as a separate
> > reply to the cover letter.
> 
> I'll drop it for the moment - I wrote that note when I first noticed
> the problem as a "reminder to self" to mention it the problem in the
> cover letter because....
> 
> > 
> > > +	 *
> > > +	 * We don't need to care about racing delalloc as we hold i_rwsem
> > >  	 * across the reserve/allocate/unreserve calls. If there are delalloc
> > >  	 * blocks in the range, they are ours.
> > >  	 */
> > > -	if ((iomap->flags & IOMAP_F_NEW) && start_fsb < end_fsb) {
> > > +	if (((iomap->flags & (IOMAP_F_NEW | IOMAP_F_STALE)) == IOMAP_F_NEW) &&
> > > +	    start_fsb < end_fsb) {
> > >  		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
> > >  					 XFS_FSB_TO_B(mp, end_fsb) - 1);
> 
> .... I really don't like this "fix". If the next mapping (the
> revalidated range) doesn't exactly fill the remainder of the
> original delalloc mapping within EOF, we end up with delalloc blocks
> within EOF that have no data in the page cache over them. i.e. this
> relies on blind luck to avoid unflushable delalloc extents and is a
> serious landmine to be leaving behind.

I'd kinda wondered over the years why not just leave pages in place and
in whatever state they were before, but never really wanted to dig too
deep into that.  I suppose I will when the v2 patchset arrives.

> The fact we want buffered writes to move to shared i_rwsem operation
> also means that we have no guarantee that nobody else has added data
> into the page cache over this delalloc range. Hence punching out the
> page cache and then the delalloc blocks is exactly the wrong thing
> to be doing.
> 
> Further, racing mappings over this delalloc range mean that those
> other contexts will also be trying to zero ranges of partial pages
> because iomap_block_needs_zeroing() returns true for IOMAP_DELALLOC
> mappings regardless of IOMAP_F_NEW.
> 
> Indeed, XFS is only using IOMAP_F_NEW on the initial delalloc
> mapping to perform the above "do we need to punch out the unused
> range" detection in xfs_buffered_write_iomap_end(). i.e. it's a flag
> that says "we allocated this delalloc range", but it in no way
> indicates "we are the only context that has written data into this
> delalloc range".
> 
> Hence I suspect that the first thing we need to do here is get rid
> of this use of IOMAP_F_NEW and the punching out of delalloc range
> on write error. I think what we need to do here is walk the page
> cache over the range of the remaining delalloc region and for every
> hole that we find in the page cache, we punch only that range out.

That would make more sense; I bet we'd have tripped over this as soon as
we shifted buffered writes to IOLOCK_SHARED and failed a write().

> We probably need to do this holding the mapping->invalidate_lock
> exclusively to ensure the page cache contents do not change while
> we are doing this walk - this will at least cause other contexts
> that have the delalloc range mapped to block during page cache
> insertion. This will then cause the the ->iomap_valid() check they
> run once the folio is inserted and locked to detect that the iomap
> they hold is now invalid an needs remapping...

<nod>

> This would avoid the need for IOMAP_F_STALE and IOMAP_F_NEW to be
> propagated into the new contexts - only iomap_iter() would need to
> handle advancing STALE maps with 0 bytes processed specially....

Ooh nice.

> > > @@ -1182,9 +1210,26 @@ xfs_buffered_write_iomap_end(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Check that the iomap passed to us is still valid for the given offset and
> > > + * length.
> > > + */
> > > +static bool
> > > +xfs_buffered_write_iomap_valid(
> > > +	struct inode		*inode,
> > > +	const struct iomap	*iomap)
> > > +{
> > > +	int			seq = *((int *)&iomap->private);
> > > +
> > > +	if (seq != READ_ONCE(XFS_I(inode)->i_df.if_seq))
> > > +		return false;
> > > +	return true;
> > > +}
> > 
> > Wheee, thanks for tackling this one. :)
> 
> I think this one might have a long way to run yet.... :/

It's gonna be a fun time backporting this all to 4.14. ;)

Btw, can you share the reproducer?

--D

> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx