Re: [BUG REPORT] Deadlock when executing xfs/168 on XFS' previous for-next

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 5 Mar 2024 17:32:45 +1100

On Tue, Mar 05, 2024 at 10:18:45AM +0530, Chandan Babu R wrote:
> Hi,
> 
> I noticed the following call trace when executing xfs/168 on XFS' for-next
> as on 3rd March. The top most commit was,
> 
>     commit 601f8bc2440a25a386b1283ce15330c9ea3aaa07
>     Merge: 8d4dd9d741c3 27c86d43bcdb
>     Author: Chandan Babu R <chandanbabu@xxxxxxxxxx>
>     Date:   Thu Feb 29 10:01:16 2024 +0530
>     
>         Merge tag 'xfs-6.8-fixes-4' into xfs-for-next
>     
>         Changes for 6.8-rc7:
>     
>           * Drop experimental warning for FSDAX.
>     
>         Signed-off-by: Chandan Babu R <chandanbabu@xxxxxxxxxx>
> 
> The fstests configuration file used was,
> 
>     TEST_DEV=/dev/loop16
>     SCRATCH_DEV_POOL="/dev/loop5 /dev/loop6 /dev/loop7 /dev/loop8 /dev/loop9 /dev/loop10 /dev/loop11 /dev/loop12"
>     MKFS_OPTIONS='-f -m reflink=1,rmapbt=1, -i sparse=1, -d su=128k,sw=4'
>     MOUNT_OPTIONS='-o usrquota,grpquota,prjquota'
>     TEST_FS_MOUNT_OPTS="$TEST_FS_MOUNT_OPTS -o usrquota,grpquota,prjquota"
>     USE_EXTERNAL=no
>     LOGWRITES_DEV=/dev/loop15
> 
> Three tasks (i.e. 230900, 230901 and 230919) are blocked on
> trying to lock AG 1's AGI and one task (i.e. 230902) is blocked on trying to
> lock AG 1's AGF.
> 
> I have not been able to recreate the problem though.

Before I even look at it, I very much doubt this is a recent
regression.

> [18460.730782] XFS (loop5): Mounting V5 Filesystem 68e1b71a-583f-41c8-8194-bb82c4e2fe0d
> [18460.742765] XFS (loop5): Ending clean mount
> [18461.786504] XFS (loop5): EXPERIMENTAL online shrink feature in use. Use at your own risk!
> [18679.145200] INFO: task fsstress:230900 blocked for more than 122 seconds.
> [18679.147164]       Tainted: G        W          6.8.0-rc6+ #1
> [18679.148710] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [18679.150423] task:fsstress        state:D stack:0     pid:230900 tgid:230900 ppid:230899 flags:0x00000002
> [18679.152186] Call Trace:
> [18679.153257]  <TASK>

Ugh - those stack traces are unreadable because of all the unwinder
false positives (all the ? entries) in them. Can you either change
the unwinder config in your builds to get rid of them or
post-process the stack dumps so they are human readable?

/me selects everything and runs `:'<,'>!grep -v "?"` make it
somewhat more readable and then trims manually:

> [18679.174407]  down+0x71/0xa0
> [18679.175442]  xfs_buf_lock+0xa4/0x290 [xfs]
> [18679.176816]  xfs_buf_find_lock+0xf7/0x4d0 [xfs]
> [18679.179246]  xfs_buf_lookup.constprop.0+0x289/0x500 [xfs]
> [18679.183248]  xfs_buf_get_map+0x2b3/0xe40 [xfs]
> [18679.198661]  xfs_buf_read_map+0xbb/0x900 [xfs]
> [18679.212987]  xfs_trans_read_buf_map+0x449/0xb10 [xfs]
> [18679.219754]  xfs_read_agi+0x1cd/0x500 [xfs]
> [18679.224471]  xfs_ialloc_read_agi+0xc2/0x5b0 [xfs]
> [18679.227863]  xfs_dialloc+0x60e/0xbc0 [xfs]
> [18679.232961]  xfs_symlink+0x855/0xdc0 [xfs]
> [18679.238776]  xfs_vn_symlink+0x15d/0x3d0 [xfs]
> [18679.242688]  vfs_symlink+0x35a/0x5a0

Ok, symlink creation reading the AGI for inode allocation. Normal.

> [18679.282122]  down+0x71/0xa0
> [18679.282962]  xfs_buf_lock+0xa4/0x290 [xfs]
> [18679.284134]  xfs_buf_find_lock+0xf7/0x4d0 [xfs]
> [18679.285249]  xfs_buf_lookup.constprop.0+0x289/0x500 [xfs]
> [18679.287895]  xfs_buf_get_map+0x2b3/0xe40 [xfs]
> [18679.301750]  xfs_buf_read_map+0xbb/0x900 [xfs]
> [18679.315706]  xfs_trans_read_buf_map+0x449/0xb10 [xfs]
> [18679.321739]  xfs_read_agi+0x1cd/0x500 [xfs]
> [18679.326984]  xfs_ialloc_read_agi+0xc2/0x5b0 [xfs]
> [18679.330566]  xfs_dialloc+0x60e/0xbc0 [xfs]
> [18679.335605]  xfs_create+0x5d9/0xf60 [xfs]
> [18679.343710]  xfs_generic_create+0x46f/0x600 [xfs]
> [18679.348897]  vfs_mknod+0x45a/0x740
> [18679.349783]  do_mknodat+0x27b/0x4d0

special file creation reading the AGI for inode allocation. Normal.

> [18679.390879]  down+0x71/0xa0
> [18679.391730]  xfs_buf_lock+0xa4/0x290 [xfs]
> [18679.392845]  xfs_buf_find_lock+0xf7/0x4d0 [xfs]
> [18679.394012]  xfs_buf_lookup.constprop.0+0x289/0x500 [xfs]
> [18679.398640]  xfs_buf_get_map+0x2b3/0xe40 [xfs]
> [18679.414394]  xfs_buf_read_map+0xbb/0x900 [xfs]
> [18679.421339]  xfs_trans_read_buf_map+0x449/0xb10 [xfs]
> [18679.426929]  xfs_read_agf+0x1be/0x4d0 [xfs]
> [18679.431980]  xfs_alloc_read_agf+0xdf/0xf90 [xfs]
> [18679.441008]  xfs_alloc_fix_freelist+0x7f2/0x970 [xfs]
> [18679.456394]  xfs_alloc_vextent_prepare_ag+0x75/0x410 [xfs]
> [18679.457598]  xfs_alloc_vextent_exact_bno+0x2b8/0x440 [xfs]
> [18679.462437]  xfs_ialloc_ag_alloc+0x513/0x18d0 [xfs]
> [18679.469776]  xfs_dialloc+0x683/0xbc0 [xfs]
> [18679.473892]  xfs_create+0x5d9/0xf60 [xfs]
> [18679.481977]  xfs_generic_create+0x46f/0x600 [xfs]
> [18679.487565]  vfs_mknod+0x45a/0x740

Special file creation holding the AGI lock doing inode chunk
allocation needing to trying to get the AGF. Normal.

> [18679.536829]  down+0x71/0xa0
> [18679.537657]  xfs_buf_lock+0xa4/0x290 [xfs]
> [18679.538731]  xfs_buf_find_lock+0xf7/0x4d0 [xfs]
> [18679.539920]  xfs_buf_lookup.constprop.0+0x289/0x500 [xfs]
> [18679.542628]  xfs_buf_get_map+0x2b3/0xe40 [xfs]
> [18679.547076]  xfs_buf_read_map+0xbb/0x900 [xfs]
> [18679.562616]  xfs_trans_read_buf_map+0x449/0xb10 [xfs]
> [18679.569778]  xfs_read_agi+0x1cd/0x500 [xfs]
> [18679.573126]  xfs_ialloc_read_agi+0xc2/0x5b0 [xfs]
> [18679.578708]  xfs_finobt_calc_reserves+0xe7/0x4d0 [xfs]
> [18679.582480]  xfs_ag_resv_init+0x2c5/0x490 [xfs]
> [18679.586023]  xfs_ag_shrink_space+0x736/0xd30 [xfs]
> [18679.590730]  xfs_growfs_data_private.isra.0+0x55e/0x990 [xfs]
> [18679.599764]  xfs_growfs_data+0x2f1/0x410 [xfs]
> [18679.602212]  xfs_file_ioctl+0xd1e/0x1370 [xfs]

Growfs - that was not what I expected -  trying to run a
shrink operation and needing to read the AGI to calculate how
much space it needs to reserve. Huh, it does <read agi>, <read agf>
before anything so it shouldn't have a lock inversion.

Oh, there's a second call to xfs_ag_resv_init in an error handling
path. Yup, there's the bug.

	<read AGI>
	<read AGF>
	<do checks>
	<free ag resv>

	/* internal log shouldn't also show up in the free space btrees */
        error = xfs_alloc_vextent_exact_bno(&args,
                        XFS_AGB_TO_FSB(mp, pag->pag_agno, aglen - delta));
        if (!error && args.agbno == NULLAGBLOCK)
                error = -ENOSPC;

        if (error) {
                /*
                 * if extent allocation fails, need to roll the transaction to
                 * ensure that the AGFL fixup has been committed anyway.
                 */
>>>>>>          xfs_trans_bhold(*tpp, agfbp);
>>>>>>          err2 = xfs_trans_roll(tpp);
                if (err2)
                        return err2;
>>>>>>          xfs_trans_bjoin(*tpp, agfbp);
>>>>>>          goto resv_init_out;
        }
....
resv_init_out:
>>>>>>  err2 = xfs_ag_resv_init(pag, *tpp);	

Ok, so on error, it holds the AGF, commits the transaction, rejoins
the locked AGF to the new transaction, and then attempts to read
the AGI to restore the AG reservations.

Yup, that's a lock order inversion - it needs to hold both the AGF
and AGI across the transaction roll to avoid needing to re-read the
AGI in the error handling path.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx