On Tue, Nov 28, 2023 at 04:38:08PM +0100, Christoph Hellwig wrote: > Hi Darrick, > > while reviewing your online repair series I've noticed that the new > xfs_buf_delwri_queue_here helper sets XBF_DONE in addition to waiting > for the buffer to go off a delwri list, and that reminded me off an > assert I saw during my allocator experiments, where > xfs_trans_read_buf_map or xfs_buf_reverify trip on a buffer that doesn't > have XBF_DONE set because it comes from an ifree transaction (see > my current not fully thought out bandaid below). I'll come back to the bug later, because I know what it is and I just haven't had time to fix it yet. I'll address XBF_DONE first. XBF_DONE means the data in the buffer is valid. It's equivalent to the uptodate bit in a folio. It has no other meaning. > The way we currently set and check XBF_DONE seems a bit undefined. The > one clear use case is that read uses it to see if a buffer was read in. Yes. > But places that use buf_get and manually fill in data only use it in a > few cases. Yes. the caller of buf_get always needs to set XBF_DONE if it is initialising a new buffer ready for it to be written. It should be done before the caller drops the buffer lock so that no other lookup can see the buffer in the state of "contains valid data but does not have XBF_DONE set". Also, there are cases where we use buf_get but we don't care about the contents being initialised because we are invalidating the buffer and need the buffer+buf_log_item to log the invalidation to the journal. In these cases we just don't care that the contents are valid, because xfs_trans_binval() calls xfs_buf_stale() to invalidate the contents and that removes the XBF_DONE flag. We do this in places like inode chunk removal to invalidate the cluster buffers to ensure they are written back after then chunk has been freed. .... and this brings us to the bug that you mentioned about an ifree transaction leaving an inode cluster buffer in cache without XBF_DONE set.... The issue is xfs_inode_item_precommit() attaching inodes to the cluster buffer. In the old days before we delayed inode logging to the end of the ifree transaction, the order was: xfs_ifree xfs_difree(ip) xfs_imap_to_bp() xfs_trans_buf_read() xfs_trans_brelse() xfs_trans_log_inode(ip) xfs_ifree_cluster(ip) for each incore inode { set XFS_ISTALE } for each cluster buffer { xfs_trans_buf_get() xfs_trans_binval() } xfs_trans_commit() IOWs, the attachment of the inode to the cluster buffer in xfs_trans_log_inode() occurred before both the inode was marked XFS_ISTALE and the cluster buffer was marked XBF_STALE and XBF_DONE was removed from it. Hence the lookup in xfs_difree() always found a valid XBF_DONE buffer. With the fixes for AGF->AGI->inode cluster buffer locking order done a while back, we moved the processing that was done in xfs_trans_log_inode() to xfs_inode_item_precommit(), which is called from xfs_trans_commit(). This moved the xfs_imap_to_bp() call when logging th einode from before the cluster invalidation to after it. The result is that imap_to_bp() now finds the inode cluster buffer in memory (as it should), but it has been marked stale (correctly!) and xfs_trans_buf_read_map() freaks out over that (again, correctly!). The key to triggering this situation is that the inode cluster buffer needs to be written back between the unlink() syscall and the inactivation processing that frees the cluster buffer. Inode cluster buffer IO completion removes the inodes from the cluster buffer, so when they are next dirtied they need to be re-added. If this inode cluster buffer writeback coincides with the transaction removing of the last inode from an inode chunk and hence freeing the inode chunk, we end up with this stiuation occurring and assert failures in xfs_trans_read_buf_map(). So, like I said: I know what the bug is, I worked it out from the one time one of my test machines tripped over it about 4 weeks ago, but I just haven't had the time since then to work out a fix. I suspect that we can check XFS_ISTALE in xfs_inode_item_precommit() and do something different, but I'd much prefer that the inode still gets added to the inode cluster buffer and cleaned up with all the other XFS_ISTALE indoes on the cluster buffer at journal commit completion time. Maybe we can pass a new flag to xfs_imap_to_bp() to say "stale buffer ok here" or something similar, because I really want the general case of xfs_trans_buf_read_map() to fail loudly if a buffer without XBF_DONE is returned.... > Do we need to define clear semantics for it? Or maybe > replace with an XBF_READ_DONE flag for that main read use case and > then think what do do with the rest? To me, the semantics of XBF_DONE are pretty clear. Apart from fixing the bug you are seeing, I'm not sure that anything really needs to change.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx