On Sat, Aug 26, 2023 at 10:54:11PM +0800, cheng.lin130@xxxxxxxxxx wrote: > > On Fri, Aug 25, 2023 at 04:32:22PM +0800, cheng.lin130@xxxxxxxxxx wrote: > > > > On Thu, Aug 24, 2023 at 03:43:52PM +0800, cheng.lin130@xxxxxxxxxx wrote: > > > >> From: Cheng Lin <cheng.lin130@xxxxxxxxxx> > > > >> An dir nlinks overflow which down form 0 to 0xffffffff, cause the > > > >> directory to become unusable until the next xfs_repair run. > > > >> > > > >> Introduce protection for drop nlink to reduce the impact of this. > > > >> And produce a warning for directory nlink error during remove. > > > >> > > > >> Signed-off-by: Cheng Lin <cheng.lin130@xxxxxxxxxx> > > > >> --- > > > >> fs/xfs/xfs_inode.c | 16 +++++++++++++++- > > > >> 1 file changed, 15 insertions(+), 1 deletion(-) > > > >> > > > >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > > >> index 9e62cc5..536dbe4 100644 > > > >> --- a/fs/xfs/xfs_inode.c > > > >> +++ b/fs/xfs/xfs_inode.c > > > >> @@ -919,6 +919,15 @@ STATIC int xfs_iunlink_remove(struct xfs_trans *tp, struct xfs_perag *pag, > > > >> xfs_trans_t *tp, > > > >> xfs_inode_t *ip) > > > >> { > > > >> + xfs_mount_t *mp; > > > >> + > > > >> + if (VFS_I(ip)->i_nlink == 0) { > > > >> + mp = ip->i_mount; > > > >> + xfs_warn(mp, "%s: Deleting inode %llu with no links.", > > > >> + __func__, ip->i_ino); > > > >> + return 0; > > > >> + } > > > >> + > > > >> xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG); > > > >> > > > >> drop_nlink(VFS_I(ip)); > > > > I'm not sure how nlink would ever get to 0xFFFFFFFF since the VFS won't > > > > let a link count exceed s_max_links, and XFS sets that to 0x7FFFFFFF. > > > > Unless, of course, you did that outside of Linux. > > > In VFS drop_nlink() only produce a warning, when (inode->i_nlink == 0), > > > not prevent its self-reduce(inode->__i_nlink--), cause it underflow > > > from 0 to 0xffffffff. > > It is interesting that vfs_unlink doesn't check the link counts of > > either the parent or the child. Maybe it should, since the VFS > > link/mkdir/rename functions check. > > I wonder if this is a historical leftover from the days when the VFS > > did no checking at all? > VFS produce a warning means it has discovered an abnormal situation. > I don't know why it just produce a warning. But, in other fs like > fuse/nfs/overlayfs/ext4, there is further protection for this situation. Well, the question is how the link count got corrupted in the first place.... > > > In the old kernel version, this situation was > > > encountered, but I don't know how it happened. It was already a scene > > > with directory errors: "Too many links". How do you overflow the directory link count in XFS? You can't fit 2^31 unique names in the directory data segment - the directory will ENOSPC at 32GB of name data, and that typically occurs with at most 300-500 million dirents (depending on name lengths) in the directory. IOWs, normal operation shouldn't be able overflow the directory link count at all, and so underruns shouldn't occur, either. > > > kernel: WARNING: CPU: 12 PID: 12928 at fs/inode.c:286 drop_nlink+0x3e/0x50 > > > kernel: CPU: 12 PID: 12928 Comm: gbased Tainted: G W OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 So this is a RHEL 7 kernel, and it is tainted with unknown out of tree (3rd party) kernel modules. Hence if could be memory corruption from whatever drivers are loaded. It's also old enough that it is posible this is a v4 filesystem, and if so, it could be that storage media corruption occurred and it wasn't detected. > > > kernel: Hardware name: HPE ProLiant BL460c Gen10/ProLiant BL460c Gen10, BIOS I41 01/23/2021 > > > kernel: Call Trace:------------------- > > > kernel: [<ffffffff816c5fce>] dump_stack+0x19/0x1b > > > kernel: [<ffffffff8108dfa8>] __warn+0xd8/0x100/* > > > kernel: [<ffffffff8108e0ed>] warn_slowpath_null+0x1d/0x20 > > > kernel: [<ffffffff8122cdfe>] drop_nlink+0x3e/0x50 > > > kernel: [<ffffffffc03cdc78>] xfs_droplink+0x28/0x60 [xfs] > > > kernel: [<ffffffffc03cf87a>] xfs_remove+0x2aa/0x320 [xfs] > > > kernel: [<ffffffffc03c9f7a>] xfs_vn_unlink+0x5a/0xa0 [xfs] > > > kernel: [<ffffffff8121f19c>] vfs_rmdir+0xdc/0x150 > > > kernel: [<ffffffff81221e41>] do_rmdir+0x1f1/0x220 > > > kernel: [<ffffffff81223046>] SyS_rmdir+0x16/0x20 > > > kernel: [<ffffffff816d86d5>] system_call_fastpath+0x1c/0x21 Without actually knowing the root cause, and directory link count overflows not being possible in normal operation, I'm not sure that we should be jumping to conclusions that the directory link count code in the upstream kernel is actually broken and needs fixing. > > > > That said, why wouldn't you /pin/ the link count at -1U instead of > > > > allowing it to overflow to zero? > > > > Could you please take a look at this patch that's waiting in my > > > > submission queue? > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=inode-repair-improvements&id=05f5a82efa6395c92038e18e008aaf7154238f27 This is protecting against regular files with too many hardlinks; a directory will never have the link count set to XFS_NLINK_PINNED because it just can't have that many entries in it.... > > > I think the XFS_NLINK_PINNEED(~0U) can be used prevent Overflow in inc_nlink(). > > > Is it better to compare i_nlink with (0U) in drop_nlink() to prevent Underflow? > > > (like this patch does, do not make i_nlink underflow from 0 to 0xffffffff) > > Is it a problem if a directory i_nlink underflows to XFS_NLINK_PINNED? > > At that point the directory will never be freed, and xfs_repair/scrub > > get to figure out the correct link count. I think that's wrong. The directory inode gets unlinked when the link count goes to zero, before the underflow occurs and can be detected. i.e. The warning occurs when the link could goes from 0 to -1 and this is after it has been unlinked on the transition between 1 to 0. If there are more entries removed from the directory at this point, the NLINK_PINNED detection then kicks in and we don't drop the nlink of the directory any further. But at this point, the damage has already been done - the directory is on the unlinked list at this point, and now it has a non-zero nlink count which means the VFS will not drop the inode and it remains cached. i.e. we have a corrupt runtime state where an inode is on the unlinked list yet isn't scheduled for inactivation/freeing when the last reference to it goes away. Indeed, inactivation won't trigger unlinked list removal, either, because the link count is not zero. I suspect at this point we have multiple on-disk corruptions in the filesystem. The parent directory points to an inode on the unlinked list, and if we crash at this point we have an inode on the unlinked that will be skipped by the recovery code because i_nlink is not zero (same iput_final/drop_inode problem). This ends up with a corrupt in-memory unlinked list on the next mount (i.e. inode on the list that isn't in memory...), and nothing good happens from that... > > --D > Yes, with i_nlink underflows to XFS_NLINK_PINNED, the directory will become > unavailable until be repaired. But the running service on this directory will be > failed. If i_nlink is protected from underflow, the use of the directory can continue, > and the continuity of services is guaranteed. The incorrect count also will be fixed > at next repair. I think that given what an underflow represents - some kind of corruption - and the fact that it can propagate in nasty ways if we allow operation to continue, we should be shutting down the filesystem if we detect an underflow on a directory inode. This will force repair of the filesystem as soon as possible. IOWs, we're already in bad situation when this warning is issued for directories, and so not dropping the nlink count after it has already underflowed doesn't matter one bit - we should be shutting down the filesystem, not trying to continue onwards as it nothing happened... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx