On Tue, Jun 13, 2023 at 11:04:34AM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@xxxxxxxxxx> > > When inactiving an unlinked inode and it's attrs, if xlog is shutdown > either during or just after the process of recurse deleting attribute > nodes/leafs in xfs_attr3_root_inactive(), the log will records some > buffer cancel items, but doesn't contain the corresponding extent > entries and inode updates, this is incomplete and inconsistent. What is incomplete and inconsistent? The directory block on disk will still contain the previously written directory block contents, so it will still be able to be read just fine and do the right thing even if it was cancelled during recovery... (Yes, I know the answer, but the commit message needs to spell out how the metadata becomes inconsistent so people trying to decide if this needs backporting can understand the scope of the problem fully.) > Because > of the inactiving process is not completed and the unlinked inode is > still in the agi_unlinked table, it will continue to be inactived after > replaying the log on the next mount, Yes, that's normal behaviour. > the attr node/leaf blocks' created > record before the cancel items could not be replayed but the inode > does. So we could get corrupted data when reading the canceled blocks. The dabtree iteration is depth first, so it should always cancels leaf blocks before it cancels node blocks, right? Hence the dabtree itself is should never be inconsistent - it should be removing the index in the node block in the same transaction that invalidating the leaf block. IOWs, if we recover the buffer cancel for a leaf block, it should also have been removed from the node block that references it in the same transaction. ..... Oh, that's the root cause of the problem, isn't it? It's following a stale pointer in the dabtree, yes? > XFS (pmem0): Metadata corruption detected at > xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78 > XFS (pmem0): Unmount and run xfs_repair > XFS (pmem0): First 128 bytes of corrupted metadata buffer: > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117 > > In order to fix the issue, we need to remove the extent entries, update > inode and attr btree atomically when staling attr node/leaf blocks. And > note that we may also need to log and update the parent attr node > entry when removing child or leaf attr block. That's what xfs_itruncate_extents(ATTR_FORK) does. Doing it dablk by dablk is rather inefficient - the code as it stands is an attempt to minimise the overhead or removing a large dabtree. So it looks to me like they problem is that when the child is invalidated, the parent is left with a stale pointer to the invalidated child in the dabtree. That's the problem you are trying to fix by unmapping the dablk before invalidation? But then you don't remove the reference in the parent block, and instead rely on the side-effect of a "read into a hole is not corruption" flag to allow recovery to follow pointers to known stale, freed dablks? i.e. we trade a "detected an invalidated block" error with a potential intentional use-after-free situation? So, yeah, I think we need to rewrite the dablkno the parent holds for the child at the same time we invalidate the child. If we rewrite the child dabno to a directory block offset we know will always be invalid (i.e. land in a hole beyond the directory size limit), then xfs_dabuf_map() will always find a hole and return a null buffer pointer. We never have to follow a pointer to a potentially freed block... We already re-read the parent buffer to get the next child and have to juggle buffer locks, etc to deal with that, so all we actually need to do is move this code up into the transaction that invalidates the child and log the rewritten dablkno in the parent block. I think we could simplify the code quite significantly by doing this. > * Invalidate any incore buffers associated with this remote attribute value > @@ -139,7 +140,8 @@ xfs_attr3_node_inactive( > xfs_daddr_t parent_blkno, child_blkno; > struct xfs_buf *child_bp; > struct xfs_da3_icnode_hdr ichdr; > - int error, i; > + int error, i, done; > + xfs_filblks_t count = mp->m_attr_geo->fsbcount; > > /* > * Since this code is recursive (gasp!) we must protect ourselves. > @@ -172,10 +174,13 @@ xfs_attr3_node_inactive( > * traversal of the tree so we may deal with many blocks > * before we come back to this one. > */ > - error = xfs_da3_node_read(*trans, dp, child_fsb, &child_bp, > - XFS_ATTR_FORK); > + error = __xfs_da3_node_read(*trans, dp, child_fsb, > + XFS_DABUF_MAP_HOLE_OK, &child_bp, > + XFS_ATTR_FORK); > if (error) > return error; > + if (!child_bp) > + goto next_entry; Hmmmm. If the kernel is down-graded after a crash with this invalidation in progress, the older kernel that doesn't have this check will crash, right? I suspect that anything we do here will have that same problem, but this is the sort of thing the commit message needs to point out because it is an important consideration in determining if this is the best fix or not... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx