On Wed, Sep 21, 2022 at 08:44:01PM -0700, Darrick J. Wong wrote: > On Wed, Sep 21, 2022 at 06:29:59PM +1000, Dave Chinner wrote: > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > @@ -1160,13 +1181,20 @@ xfs_buffered_write_iomap_end( > > > > /* > > * Trim delalloc blocks if they were allocated by this write and we > > - * didn't manage to write the whole range. > > + * didn't manage to write the whole range. If the iomap was marked stale > > + * because it is no longer valid, we are going to remap this range > > + * immediately, so don't punch it out. > > * > > - * We don't need to care about racing delalloc as we hold i_mutex > > + * XXX (dgc): This next comment and assumption is totally bogus because > > + * iomap_page_mkwrite() runs through here and it doesn't hold the > > + * i_rwsem. Hence this whole error handling path may be badly broken. > > That probably needs fixing, though I'll break that out as a separate > reply to the cover letter. I'll drop it for the moment - I wrote that note when I first noticed the problem as a "reminder to self" to mention it the problem in the cover letter because.... > > > + * > > + * We don't need to care about racing delalloc as we hold i_rwsem > > * across the reserve/allocate/unreserve calls. If there are delalloc > > * blocks in the range, they are ours. > > */ > > - if ((iomap->flags & IOMAP_F_NEW) && start_fsb < end_fsb) { > > + if (((iomap->flags & (IOMAP_F_NEW | IOMAP_F_STALE)) == IOMAP_F_NEW) && > > + start_fsb < end_fsb) { > > truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb), > > XFS_FSB_TO_B(mp, end_fsb) - 1); .... I really don't like this "fix". If the next mapping (the revalidated range) doesn't exactly fill the remainder of the original delalloc mapping within EOF, we end up with delalloc blocks within EOF that have no data in the page cache over them. i.e. this relies on blind luck to avoid unflushable delalloc extents and is a serious landmine to be leaving behind. The fact we want buffered writes to move to shared i_rwsem operation also means that we have no guarantee that nobody else has added data into the page cache over this delalloc range. Hence punching out the page cache and then the delalloc blocks is exactly the wrong thing to be doing. Further, racing mappings over this delalloc range mean that those other contexts will also be trying to zero ranges of partial pages because iomap_block_needs_zeroing() returns true for IOMAP_DELALLOC mappings regardless of IOMAP_F_NEW. Indeed, XFS is only using IOMAP_F_NEW on the initial delalloc mapping to perform the above "do we need to punch out the unused range" detection in xfs_buffered_write_iomap_end(). i.e. it's a flag that says "we allocated this delalloc range", but it in no way indicates "we are the only context that has written data into this delalloc range". Hence I suspect that the first thing we need to do here is get rid of this use of IOMAP_F_NEW and the punching out of delalloc range on write error. I think what we need to do here is walk the page cache over the range of the remaining delalloc region and for every hole that we find in the page cache, we punch only that range out. We probably need to do this holding the mapping->invalidate_lock exclusively to ensure the page cache contents do not change while we are doing this walk - this will at least cause other contexts that have the delalloc range mapped to block during page cache insertion. This will then cause the the ->iomap_valid() check they run once the folio is inserted and locked to detect that the iomap they hold is now invalid an needs remapping... This would avoid the need for IOMAP_F_STALE and IOMAP_F_NEW to be propagated into the new contexts - only iomap_iter() would need to handle advancing STALE maps with 0 bytes processed specially.... > > @@ -1182,9 +1210,26 @@ xfs_buffered_write_iomap_end( > > return 0; > > } > > > > +/* > > + * Check that the iomap passed to us is still valid for the given offset and > > + * length. > > + */ > > +static bool > > +xfs_buffered_write_iomap_valid( > > + struct inode *inode, > > + const struct iomap *iomap) > > +{ > > + int seq = *((int *)&iomap->private); > > + > > + if (seq != READ_ONCE(XFS_I(inode)->i_df.if_seq)) > > + return false; > > + return true; > > +} > > Wheee, thanks for tackling this one. :) I think this one might have a long way to run yet.... :/ -Dave. -- Dave Chinner david@xxxxxxxxxxxxx