On Tue, Nov 28, 2023 at 08:58:31AM -0800, Darrick J. Wong wrote: > On Tue, Nov 28, 2023 at 04:38:08PM +0100, Christoph Hellwig wrote: > > Hi Darrick, > > > > while reviewing your online repair series I've noticed that the new > > xfs_buf_delwri_queue_here helper sets XBF_DONE in addition to waiting > > for the buffer to go off a delwri list, and that reminded me off an > > assert I saw during my allocator experiments, where > > xfs_trans_read_buf_map or xfs_buf_reverify trip on a buffer that doesn't > > have XBF_DONE set because it comes from an ifree transaction (see > > my current not fully thought out bandaid below). > > LOL, guess what I was looking at yesterday too! :) > > > The way we currently set and check XBF_DONE seems a bit undefined. The > > one clear use case is that read uses it to see if a buffer was read in. > > But places that use buf_get and manually fill in data only use it in a > > few cases. Do we need to define clear semantics for it? Or maybe > > replace with an XBF_READ_DONE flag for that main read use case and > > then think what do do with the rest? > > I thought XBF_DONE meant "contents have been read in from disk and > have passed/will pass verifiers" Effectively. > > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c > > index 8e886ecfd69a3b..575922c64d4d3a 100644 > > --- a/fs/xfs/xfs_trans_buf.c > > +++ b/fs/xfs/xfs_trans_buf.c > > @@ -253,7 +253,6 @@ xfs_trans_read_buf_map( > > ASSERT(bp->b_transp == tp); > > ASSERT(bp->b_log_item != NULL); > > ASSERT(!bp->b_error); > > - ASSERT(bp->b_flags & XBF_DONE); > > I don't think this is the right thing to do here -- if the buffer is > attached to a transaction, it ought to be XBF_DONE. I think every > transaction that calls _get_buf and rewrites the buffer contents will > set XBF_DONE via xfs_trans_dirty_buf, right? It should, yes. Otherwise the initialisation data in the buffer has not been logged and may result in writeback and/or recovery being incorrectly ordered w.r.t. the transaction that initialised the buffer contents. > Hmm. Maybe I'm wrong -- a transaction could bjoin a buffer and then > call xfs_trans_read_buf_map before dirtying it. That strikes me as a > suspicious thing to do, though. It also seems to me that it doesn't solve the problem of buffer invalidation followed by a read operation in commit context. Having though a bit more on this (admittedly they are feverish, covid-inspired thoughts), I suspect the real problem here is requiring xfs_imap_to_bp() to ensure we can pin the cluster buffer in memory whilst there are dirty inodes over it. If we go back to the days of Irix and the early days of the linux port, we always pinned the inode cluster buffer in memory whilst we had an inode active in cache via the inode cluster hash index. We could do lookups directly on in-memory inode cluster buffers rather than inodes. Inodes held references on the cluster buffer, and when the last reference to a cluster buffer went away, it was removed from the inode cluster hash. i.e. we never had the inode cluster buffer RMW problem where cached inodes got dirtied without a cluster buffer already being in memory. It was simple, and the only downside to it was that it didn't scale. Hence we got rid of the inode cluster hash index and the pinning of inode cluster buffers with the introduction of the RCU protected inode cache based on radix trees. The radix trees were so much more efficient than a fixed sized cluster hash that we simply did away with them, and got a memory footprint improvement for read-mostly inode traversal workloads for free. Perhaps it is time to revisit that ~15 year old architectural choice? We've reinstated the pinning for dirty inodes, perhaps we should just do it unconditionally for all inodes now and reinstate the direct inode -> cluster buffer relationship we once had. This has other benefits as well. We can "densify" the XFS clean inode cache by making VFS inodes unconditionally I_DONT_CACHE, and simply rely on the xfs buffer cache LRU to hold recently used inode buffers. It solves several nasty corner cases around inode cluster buffer freeing and XFS_ISTALE. It allows us to avoid cluster buffer lookups in the inode logging path. It opens the gate to shared access to the buffer for flushing dirty inodes to the cluster buffer without adding new lifecycle problems. It allows significant efficiency optimisations in managing inode items in the AIL because lifecycle discrepancies between cluster buffers and inodes go away. And so on. So perhaps the best fix here is reinstate the old behaviour of inodes always pinning the inode cluster buffer in memory and hence eliding the need for on-demand, log item/IO completion time item tracking altogether.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx