On Fri, Apr 21, 2023 at 05:49:32PM +1000, Dave Chinner wrote: > On Fri, Apr 21, 2023 at 11:31:40AM +0800, Guo Xuenan wrote: > > I observed the following evidence of a leak while xfs_inactive failed. > > Especially in Debug mode, when xfs_attr_inactive failed, current exception > > path handling rudely clear inode attr fork, and if the inode is recycled > > then assertion will accur, if not, which may also lead to memory leak. > > > > Since xfs_attr_inactive is supposed to clean up the attribute fork when > > the inode is being freed. While it removes the in-memory attribute fork > > even in the event of truncate attribute fork extents failure, then some > > attr data may left in memory and never be released. By avoiding this kind > > of clean up and readding the inode to the gclist, this type of problems can > > be solved to some extent. > > > > The following script reliably replays the bug described above. > > ``` > > #!/bin/bash > > DISK=vdb > > MP=/mnt/$DISK > > DEV=/dev/$DISK > > nfiles=10 > > xattr_val="this is xattr value." > > ## prepare and mount > > while true > > do > > pidof fsstress | xargs kill -9 > > umount $MP > > df | grep $MP || break > > sleep 2 > > done > > > > mkdir -p ${MP} && mkfs.xfs -f $DEV && mount $DEV $MP > > echo 0 > /sys/fs/xfs/$DISK/errortag/bmapifmt > > > > ## create files, setxattr, remove files > > cd $MP; touch $(seq 1 $nfiles); cd $OLDPWD > > for n in `seq 1 $nfiles`; do > > for j in `seq 1 20`; do > > setfattr -n user.${j} -v "$xattr_val" $MP/$n > > done > > done > > OK, so 20 xattrs of about 40 bytes on disk each in each file. That's > enough to put them in extent format. > > > ## inject fault & trigger fails > > fsstress -d $MP -z -f bulkstat=200 -S c -l 1000 -p 8 & > > /usr/bin/rm $MP/* > > Queue everything up for inodegc > > > echo 3 > /sys/fs/xfs/$DISK/errortag/bmapifmt > > And inject random xfs_bmapi_read() errors into inodegc processing. > > > ``` > > > > Assertion in the kernel log as follows: > > > > XFS (vdb): Mounting V5 Filesystem bd1b6c38-599a-43b3-8194-a584bebec4ca > > XFS (vdb): Ending clean mount > > xfs filesystem being mounted at /mnt/vdb supports timestamps until 2038 (0x7fffffff) > > XFS (vdb): Injecting error (false) at file fs/xfs/libxfs/xfs_bmap.c, line 3887, on filesystem "vdb" > > XFS: Assertion failed: ip->i_nblocks == 0, file: fs/xfs/xfs_inode.c, line: 2277 > > Ok, so the error injection has caused xfs_bmapi_read() when > truncating the attr fork to return -EFSCORRUPTED, which > xfs_inactive() has completely dropped on the ground and then failed > to free the inode before marking if valid for reclaim. > > The the inode freeing code has asserted that the inode is not clean > and we're freeing an inode that still references filesystem blocks. > The ASSERT() is certainly valid and should have triggered, but I > don't think that it needs fixing. I'll get to that later. > > > diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c > > index 5db87b34fb6e..a7379beb355a 100644 > > --- a/fs/xfs/xfs_attr_inactive.c > > +++ b/fs/xfs/xfs_attr_inactive.c > > @@ -336,21 +336,25 @@ xfs_attr_inactive( > > ASSERT(! XFS_NOT_DQATTACHED(mp, dp)); > > > > xfs_ilock(dp, lock_mode); > > - if (!xfs_inode_has_attr_fork(dp)) > > - goto out_destroy_fork; > > + if (!xfs_inode_has_attr_fork(dp)) { > > + xfs_ifork_zap_attr(dp); > > + goto out_unlock; > > + } > > xfs_iunlock(dp, lock_mode); > > Why do we even need the lock here for this check? The inode cannot > be found by a VFS lookup, and we are in XFS_INACTIVATING state which > means internal inode looks will skip it too. So nothing other than > the inodegc code can be referencing the inode attr fork here. So > what's the lock for? > > I'd just replace this with: > > if (!xfs_inode_has_attr_fork(dp)) > return 0; > > > lock_mode = 0; > > > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_attrinval, 0, 0, 0, &trans); > > if (error) > > - goto out_destroy_fork; > > + goto out_unlock; > > We have nothing locked here. > > if (error) > return error; > > > lock_mode = XFS_ILOCK_EXCL; > > xfs_ilock(dp, lock_mode); > > > > - if (!xfs_inode_has_attr_fork(dp)) > > + if (!xfs_inode_has_attr_fork(dp)) { > > + xfs_ifork_zap_attr(dp); > > goto out_cancel; > > + } > > > > /* > > * No need to make quota reservations here. We expect to release some > > @@ -383,9 +387,7 @@ xfs_attr_inactive( > > > > out_cancel: > > xfs_trans_cancel(trans); > > -out_destroy_fork: > > - /* kill the in-core attr fork before we drop the inode lock */ > > - xfs_ifork_zap_attr(dp); > > +out_unlock: > > if (lock_mode) > > xfs_iunlock(dp, lock_mode); > > return error; > > Hmmmm. No, I don't think this is right - the existing code is doing > the right thing. > > We just got an -EFSCORRUPTED from xfs_itruncate_extents(), and we > are in the process of freeing the inode. The inode is unlinked, > unrefered and corrupt, so we need to leak the blocks we cannot > access in the attr fork to be able to successfully free the inode > and continue on without shutting down the filesystem. > > In this situation, we need to zap the in-memory attr fork so that > the memory associated with the extents we just decided to leak is > freed correctly and not leaked. Essentially, any error in this path > should be zapping the attr fork on error here. > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 351849fc18ff..4afa7fec4a2f 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -48,6 +48,7 @@ static int xfs_icwalk(struct xfs_mount *mp, > > enum xfs_icwalk_goal goal, struct xfs_icwalk *icw); > > static int xfs_icwalk_ag(struct xfs_perag *pag, > > enum xfs_icwalk_goal goal, struct xfs_icwalk *icw); > > +static void xfs_inodegc_queue(struct xfs_inode *ip); > > > > /* > > * Private inode cache walk flags for struct xfs_icwalk. Must not > > @@ -1843,7 +1844,10 @@ xfs_inodegc_inactivate( > > { > > trace_xfs_inode_inactivating(ip); > > xfs_inactive(ip); > > - xfs_inodegc_set_reclaimable(ip); > > + if (VFS_I(ip)->i_mode != 0) > > + xfs_inodegc_queue(ip); > > + else > > + xfs_inodegc_set_reclaimable(ip); > > } > > I don't think this works the way you think it does. > > Firstly, xfs_inodegc_queue() may try to flush the work > queue (i.e. wait for pending work to be completed) so queuing > inodegc work from an inodegc workqueue worker could deadlock the > inodegc workqueue. > > Secondly, inodes with i_mode != 0 that aren't being unlinked are > also pushed through this path to remove blocks beyond EOF, etc. This > change means those inodes get permanently stuck on the inodegc queue > as they always get readded to the queue and never marked as > reclaimable. See xfs_inode_mark_reclaimable() and > xfs_inode_needs_inactive(). > > Thirdly, pushing an inode that failed inactivation due to corruption > through xfs_inactive again will just result in the same corruption > error occurring again. Hence they'll also get stuck forever in the > inodegc loop. > > Ah. > > Yeah. > > Except for your test case. > > Error injection is random/transient, and you gave it a 33% chance of > injecting an error. Hence it will continue looping through > inactivation until it doesn't trigger the xfs_bmapi_read() error > injection. Because you also changed it not to trash the attr fork on > truncation error, the extent list remains present until every extent > gets freed successfully. Hence ip->i_nblocks gets reduced to zero > before we try to free the inode and the specific failure your error > injection test tripped over goes away. > > Now I understand - this fixes a transient corruption error > caused by error injection, but in doing so will break production > systems that encounter persistent corruption in inode inactivation. > > --- > > Yes, I agree the way xfs_inactive() and inodegc handles errors and > cleanup needs improvement, but we've known this for a while now. But > this doesn't change the fact that we currently need to be able to > leak resources we can't access so we can continue to operate. It's > fine for ASSERTs to fire on debug kernels in these situations - as > developers we need to understand when these situations occur - but > that doesn't mean the behaviour they are warning about needs to be > fixed. It's just telling us that we are leaking stuff, it just > doesn't know why. ...and should probably be logging the fact that the bad inode was dropped on the floor and the sysadmin should go run a fsck tool of some kind to fix the problems. > We have be waiting on having fine grained health infomration for > inodes to be able to handle situations like this more gracefully. > That code is being merged in 6.4, and it means that we know the It is? I didn't send you a pull request for https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting for 6.4. At some point I want to talk to you about the rest of online fsck, but I'm taking a breather for the last week or two until LSFMM. --D > inode is corrupt. Hence we can tailor the debug assert to only fire > if the inode has not encountered corruption, do smarter things in > inodegc/reclaim when dealing with inodes that are corrupt, etc. > > However, that doesn't mean the inactivation behaviour that triggered > the assert is wrong. That assert ioften picks up code bugs, but a > corruption error in freeing an inode is one of the situations where > the "leak resources we can't access to free" behaviour is > intentional. That's what we want production machines to do, whilst > on debug machines we want to understand why that situation happened > and determine if there's something better we can do with it. > > Hence I'd suggest looking at the inode health code in the current > linux-xfs/for-next tree and checking that the error injection you > are using marks the inode as unhealthy and work from there... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx