On Wed, Nov 02, 2022 at 10:19:33AM -0700, Darrick J. Wong wrote: > On Tue, Nov 01, 2022 at 11:34:11AM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > Now that iomap supports a mechanism to validate cached iomaps for > > buffered write operations, hook it up to the XFS buffered write ops > > so that we can avoid data corruptions that result from stale cached > > iomaps. See: > > > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@xxxxxxxxxxxxxxxxxxx/ > > > > or the ->iomap_valid() introduction commit for exact details of the > > corruption vector. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > --- > > fs/xfs/libxfs/xfs_bmap.c | 4 +-- > > fs/xfs/xfs_aops.c | 2 +- > > fs/xfs/xfs_iomap.c | 69 ++++++++++++++++++++++++++++++---------- > > fs/xfs/xfs_iomap.h | 4 +-- > > fs/xfs/xfs_pnfs.c | 5 +-- > > 5 files changed, 61 insertions(+), 23 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > > index 49d0d4ea63fc..db225130618c 100644 > > --- a/fs/xfs/libxfs/xfs_bmap.c > > +++ b/fs/xfs/libxfs/xfs_bmap.c > > @@ -4551,8 +4551,8 @@ xfs_bmapi_convert_delalloc( > > * the extent. Just return the real extent at this offset. > > */ > > if (!isnullstartblock(bma.got.br_startblock)) { > > - xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags); > > *seq = READ_ONCE(ifp->if_seq); > > + xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq); > > goto out_trans_cancel; > > } > > > > @@ -4599,8 +4599,8 @@ xfs_bmapi_convert_delalloc( > > XFS_STATS_INC(mp, xs_xstrat_quick); > > > > ASSERT(!isnullstartblock(bma.got.br_startblock)); > > - xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags); > > *seq = READ_ONCE(ifp->if_seq); > > + xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq); > > > > if (whichfork == XFS_COW_FORK) > > Hmm. @ifp here could be the cow fork, which means *seq will be from the > cow fork. This I think is going to cause problems in > xfs_buffered_write_iomap_valid... We can't get here from the buffered write path. This can only be reached by the writeback path via: iomap_do_writepage iomap_writepage_map xfs_map_blocks xfs_convert_blocks xfs_bmapi_convert_delalloc Hence the sequence numbers stored in iomap->private here are completely ignored by the writeback code. They still get stored in the struct xfs_writepage_ctx and checked by xfs_imap_valid(), so the changes here were really just making sure all the xfs_bmbt_to_iomap() callers stored the sequence number consistently. > > @@ -839,24 +848,25 @@ xfs_direct_write_iomap_begin( > > xfs_iunlock(ip, lockmode); > > > > error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, > > - flags, &imap); > > + flags, &imap, &seq); > > if (error) > > return error; > > > > trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap); > > return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, > > - iomap_flags | IOMAP_F_NEW); > > + iomap_flags | IOMAP_F_NEW, seq); > > > > out_found_cow: > > + seq = READ_ONCE(ip->i_df.if_seq); > > xfs_iunlock(ip, lockmode); > > length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount); > > trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap); > > if (imap.br_startblock != HOLESTARTBLOCK) { > > - error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0); > > + error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq); > > if (error) > > return error; > > } > > - return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED); > > + return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq); > > Here we've found a mapping from the COW fork and we're encoding it into > the struct iomap. Why is the sequence number from the *data* fork and > not the COW fork? Because my brain isn't big enough to hold all this code straight in my head. I found it all too easy to get confused by what iomap is passed where and how to tell them apart and I could not tell if there was an easy way to tell if the iomap passed to xfs_iomap_valid() needed to check against the data or cow fork. Hence I set it up such that the xfs_iomap_valid() callback only checks the data fork sequence so the only correct thing to set in the iomap is the data fork sequence and moved on to the next part of the problem... But, again, this information probably should be encoded into the sequence cookie we store. Which begs the question - if we are doing a COW operation, we have to check that neither the data fork (the srcmap) nor the COW fork (the iter->iomap) have changed, right? This is what the writeback code does (i.e. checks both data and COW fork sequence numbers), so perhaps that's also what we should be doing here? i.e. either the cookie for COW operations needs to contain both data+cow sequence numbers, and both need to match to proceed, or we have to validate both the srcmap and the iter->iomap with separate callouts once the folio is locked. Thoughts? > > @@ -1328,9 +1342,26 @@ xfs_buffered_write_iomap_end( > > return 0; > > } > > > > +/* > > + * Check that the iomap passed to us is still valid for the given offset and > > + * length. > > + */ > > +static bool > > +xfs_buffered_write_iomap_valid( > > + struct inode *inode, > > + const struct iomap *iomap) > > +{ > > + int seq = *((int *)&iomap->private); > > + > > + if (seq != READ_ONCE(XFS_I(inode)->i_df.if_seq)) > > Why is it ok to sample the data fork's sequence number even if the > mapping came from the COW fork? That doesn't make sense to me, > conceptually. I definitely think it's buggy that the revalidation > might not sample from the same sequence counter as the original > measurement. Yup, see above. > > const struct iomap_ops xfs_read_iomap_ops = { > > @@ -1438,7 +1471,7 @@ xfs_seek_iomap_begin( > > end_fsb = min(end_fsb, data_fsb); > > xfs_trim_extent(&cmap, offset_fsb, end_fsb); > > error = xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, > > - IOMAP_F_SHARED); > > + IOMAP_F_SHARED, READ_ONCE(ip->i_cowfp->if_seq)); > > Here we explicitly sample from the cow fork but the comparison is done > against the data fork. That /probably/ doesn't matter for reads since > you're not proposing that we revalidate on those. Should we? As I said elsewhere, I haven't even looked at the read path. The iomap code is now complex enough that I have trouble following it and keeping all the corner cases in my head. This is the down side of having people smarter than you write the code you need to debug when shit inevitably goes wrong. > What > happens if userspace hands us a large read() from an unwritten extent > and the same dirty/writeback/reclaim sequence happens to the last folio > in that read() range before read_iter actually gets there? No idea - shit may well go wrong there in the same way, but I've had my hands full just fixing the write path. I'm *really* tired of fighting with the iomap code right now... > > @@ -189,7 +190,7 @@ xfs_fs_map_blocks( > > xfs_iunlock(ip, lock_flags); > > > > error = xfs_iomap_write_direct(ip, offset_fsb, > > - end_fsb - offset_fsb, 0, &imap); > > + end_fsb - offset_fsb, 0, &imap, &seq); > > I got a compiler warning about seq not being set in the case where we > don't call xfs_iomap_write_direct. Yup, gcc 11.2 doesn't warn about it. Now to find out why my build system is using gcc 11.2 when I upgraded it months ago to use gcc-12 because of exactly these sorts of issues.... :( -Dave. -- Dave Chinner david@xxxxxxxxxxxxx