On Fri, Apr 21, 2023 at 11:31:40AM +0800, Guo Xuenan wrote: > I observed the following evidence of a leak while xfs_inactive failed. > Especially in Debug mode, when xfs_attr_inactive failed, current exception > path handling rudely clear inode attr fork, and if the inode is recycled > then assertion will accur, if not, which may also lead to memory leak. > > Since xfs_attr_inactive is supposed to clean up the attribute fork when > the inode is being freed. While it removes the in-memory attribute fork > even in the event of truncate attribute fork extents failure, then some > attr data may left in memory and never be released. By avoiding this kind > of clean up and readding the inode to the gclist, this type of problems can > be solved to some extent. > > The following script reliably replays the bug described above. > ``` > #!/bin/bash > DISK=vdb > MP=/mnt/$DISK > DEV=/dev/$DISK > nfiles=10 > xattr_val="this is xattr value." > ## prepare and mount > while true > do > pidof fsstress | xargs kill -9 > umount $MP > df | grep $MP || break > sleep 2 > done > > mkdir -p ${MP} && mkfs.xfs -f $DEV && mount $DEV $MP > echo 0 > /sys/fs/xfs/$DISK/errortag/bmapifmt > > ## create files, setxattr, remove files > cd $MP; touch $(seq 1 $nfiles); cd $OLDPWD > for n in `seq 1 $nfiles`; do > for j in `seq 1 20`; do > setfattr -n user.${j} -v "$xattr_val" $MP/$n > done > done OK, so 20 xattrs of about 40 bytes on disk each in each file. That's enough to put them in extent format. > ## inject fault & trigger fails > fsstress -d $MP -z -f bulkstat=200 -S c -l 1000 -p 8 & > /usr/bin/rm $MP/* Queue everything up for inodegc > echo 3 > /sys/fs/xfs/$DISK/errortag/bmapifmt And inject random xfs_bmapi_read() errors into inodegc processing. > ``` > > Assertion in the kernel log as follows: > > XFS (vdb): Mounting V5 Filesystem bd1b6c38-599a-43b3-8194-a584bebec4ca > XFS (vdb): Ending clean mount > xfs filesystem being mounted at /mnt/vdb supports timestamps until 2038 (0x7fffffff) > XFS (vdb): Injecting error (false) at file fs/xfs/libxfs/xfs_bmap.c, line 3887, on filesystem "vdb" > XFS: Assertion failed: ip->i_nblocks == 0, file: fs/xfs/xfs_inode.c, line: 2277 Ok, so the error injection has caused xfs_bmapi_read() when truncating the attr fork to return -EFSCORRUPTED, which xfs_inactive() has completely dropped on the ground and then failed to free the inode before marking if valid for reclaim. The the inode freeing code has asserted that the inode is not clean and we're freeing an inode that still references filesystem blocks. The ASSERT() is certainly valid and should have triggered, but I don't think that it needs fixing. I'll get to that later. > diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c > index 5db87b34fb6e..a7379beb355a 100644 > --- a/fs/xfs/xfs_attr_inactive.c > +++ b/fs/xfs/xfs_attr_inactive.c > @@ -336,21 +336,25 @@ xfs_attr_inactive( > ASSERT(! XFS_NOT_DQATTACHED(mp, dp)); > > xfs_ilock(dp, lock_mode); > - if (!xfs_inode_has_attr_fork(dp)) > - goto out_destroy_fork; > + if (!xfs_inode_has_attr_fork(dp)) { > + xfs_ifork_zap_attr(dp); > + goto out_unlock; > + } > xfs_iunlock(dp, lock_mode); Why do we even need the lock here for this check? The inode cannot be found by a VFS lookup, and we are in XFS_INACTIVATING state which means internal inode looks will skip it too. So nothing other than the inodegc code can be referencing the inode attr fork here. So what's the lock for? I'd just replace this with: if (!xfs_inode_has_attr_fork(dp)) return 0; > lock_mode = 0; > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_attrinval, 0, 0, 0, &trans); > if (error) > - goto out_destroy_fork; > + goto out_unlock; We have nothing locked here. if (error) return error; > lock_mode = XFS_ILOCK_EXCL; > xfs_ilock(dp, lock_mode); > > - if (!xfs_inode_has_attr_fork(dp)) > + if (!xfs_inode_has_attr_fork(dp)) { > + xfs_ifork_zap_attr(dp); > goto out_cancel; > + } > > /* > * No need to make quota reservations here. We expect to release some > @@ -383,9 +387,7 @@ xfs_attr_inactive( > > out_cancel: > xfs_trans_cancel(trans); > -out_destroy_fork: > - /* kill the in-core attr fork before we drop the inode lock */ > - xfs_ifork_zap_attr(dp); > +out_unlock: > if (lock_mode) > xfs_iunlock(dp, lock_mode); > return error; Hmmmm. No, I don't think this is right - the existing code is doing the right thing. We just got an -EFSCORRUPTED from xfs_itruncate_extents(), and we are in the process of freeing the inode. The inode is unlinked, unrefered and corrupt, so we need to leak the blocks we cannot access in the attr fork to be able to successfully free the inode and continue on without shutting down the filesystem. In this situation, we need to zap the in-memory attr fork so that the memory associated with the extents we just decided to leak is freed correctly and not leaked. Essentially, any error in this path should be zapping the attr fork on error here. > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 351849fc18ff..4afa7fec4a2f 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -48,6 +48,7 @@ static int xfs_icwalk(struct xfs_mount *mp, > enum xfs_icwalk_goal goal, struct xfs_icwalk *icw); > static int xfs_icwalk_ag(struct xfs_perag *pag, > enum xfs_icwalk_goal goal, struct xfs_icwalk *icw); > +static void xfs_inodegc_queue(struct xfs_inode *ip); > > /* > * Private inode cache walk flags for struct xfs_icwalk. Must not > @@ -1843,7 +1844,10 @@ xfs_inodegc_inactivate( > { > trace_xfs_inode_inactivating(ip); > xfs_inactive(ip); > - xfs_inodegc_set_reclaimable(ip); > + if (VFS_I(ip)->i_mode != 0) > + xfs_inodegc_queue(ip); > + else > + xfs_inodegc_set_reclaimable(ip); > } I don't think this works the way you think it does. Firstly, xfs_inodegc_queue() may try to flush the work queue (i.e. wait for pending work to be completed) so queuing inodegc work from an inodegc workqueue worker could deadlock the inodegc workqueue. Secondly, inodes with i_mode != 0 that aren't being unlinked are also pushed through this path to remove blocks beyond EOF, etc. This change means those inodes get permanently stuck on the inodegc queue as they always get readded to the queue and never marked as reclaimable. See xfs_inode_mark_reclaimable() and xfs_inode_needs_inactive(). Thirdly, pushing an inode that failed inactivation due to corruption through xfs_inactive again will just result in the same corruption error occurring again. Hence they'll also get stuck forever in the inodegc loop. Ah. Yeah. Except for your test case. Error injection is random/transient, and you gave it a 33% chance of injecting an error. Hence it will continue looping through inactivation until it doesn't trigger the xfs_bmapi_read() error injection. Because you also changed it not to trash the attr fork on truncation error, the extent list remains present until every extent gets freed successfully. Hence ip->i_nblocks gets reduced to zero before we try to free the inode and the specific failure your error injection test tripped over goes away. Now I understand - this fixes a transient corruption error caused by error injection, but in doing so will break production systems that encounter persistent corruption in inode inactivation. --- Yes, I agree the way xfs_inactive() and inodegc handles errors and cleanup needs improvement, but we've known this for a while now. But this doesn't change the fact that we currently need to be able to leak resources we can't access so we can continue to operate. It's fine for ASSERTs to fire on debug kernels in these situations - as developers we need to understand when these situations occur - but that doesn't mean the behaviour they are warning about needs to be fixed. It's just telling us that we are leaking stuff, it just doesn't know why. We have be waiting on having fine grained health infomration for inodes to be able to handle situations like this more gracefully. That code is being merged in 6.4, and it means that we know the inode is corrupt. Hence we can tailor the debug assert to only fire if the inode has not encountered corruption, do smarter things in inodegc/reclaim when dealing with inodes that are corrupt, etc. However, that doesn't mean the inactivation behaviour that triggered the assert is wrong. That assert ioften picks up code bugs, but a corruption error in freeing an inode is one of the situations where the "leak resources we can't access to free" behaviour is intentional. That's what we want production machines to do, whilst on debug machines we want to understand why that situation happened and determine if there's something better we can do with it. Hence I'd suggest looking at the inode health code in the current linux-xfs/for-next tree and checking that the error injection you are using marks the inode as unhealthy and work from there... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx