On Tue, Apr 11, 2023 at 12:06:24 PM +1000, Dave Chinner wrote: > On Thu, Mar 30, 2023 at 01:46:10PM -0700, Wengang Wang wrote: >> There is deadlock with calltrace on process 10133: >> >> PID 10133 not sceduled for 4403385ms (was on CPU[10]) >> #0 context_switch() kernel/sched/core.c:3881 >> #1 __schedule() kernel/sched/core.c:5111 >> #2 schedule() kernel/sched/core.c:5186 >> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 >> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 >> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 >> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 >> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 >> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 >> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 >> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 >> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 >> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 >> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 >> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 >> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 >> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 >> #17 mount_bdev() fs/super.c:1417 >> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 >> #19 legacy_get_tree() fs/fs_context.c:647 >> #20 vfs_get_tree() fs/super.c:1547 >> #21 do_new_mount() fs/namespace.c:2843 >> #22 do_mount() fs/namespace.c:3163 >> #23 ksys_mount() fs/namespace.c:3372 >> #24 __do_sys_mount() fs/namespace.c:3386 >> #25 __se_sys_mount() fs/namespace.c:3383 >> #26 __x64_sys_mount() fs/namespace.c:3383 >> #27 do_syscall_64() arch/x86/entry/common.c:296 >> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 >> >> It's waiting xfs_perag.pagb_gen to increase (busy extent clearing happen). >> From the vmcore, it's waiting on AG 1. And the ONLY busy extent for AG 1 is >> with the transaction (in xfs_trans.t_busy) for process 10133. That busy extent >> is created in a previous EFI with the same transaction. Process 10133 is >> waiting, it has no change to commit that that transaction. So busy extent >> clearing can't happen and pagb_gen remain unchanged. So dead lock formed. > > We've talked about this "busy extent in transaction" issue before: > > https://lore.kernel.org/linux-xfs/20210428065152.77280-1-chandanrlinux@xxxxxxxxx/ > > and we were closing in on a practical solution before it went silent. > > I'm not sure if there's a different fix we can apply here - maybe > free one extent per transaction instead of all the extents in an EFI > in one transaction and relog the EFD at the end of each extent free > transaction roll? > Consider the case of executing a truncate operation which involves freeing two file extents on a filesystem which has refcount feature enabled. xfs_refcount_decrease_extent() will be invoked twice and hence XFS_DEFER_OPS_TYPE_REFCOUNT will have two "struct xfs_refcount_intent" associated with it. Processing each of the "struct xfs_refcount_intent" can cause two refcount btree blocks to be freed: - A high level transacation will invoke xfs_refcountbt_free_block() twice. - The first invocation adds an extent entry to the transaction's busy extent list. The second invocation can find the previously freed busy extent and hence wait indefinitely for the busy extent to be flushed. Also, processing a single "struct xfs_refcount_intent" can require the leaf block and its immediate parent block to be freed. The leaf block is added to the transaction's busy list. Freeing the parent block can result in the task waiting for the busy extent (present in the high level transaction) to be flushed. Hence, IMHO this approach is most likely not a feasible solution. >> commit 06058bc40534530e617e5623775c53bb24f032cb disallowed using busy extents >> for any path that calls xfs_extent_busy_trim(). That looks over-killing. >> For AGFL block allocation, it just use the first extent that satisfies, it won't >> try another extent for choose a "better" one. So it's safe to reuse busy extent >> for AGFL. > > AGFL block allocation is not "for immediate use". The blocks get > placed on the AGFL for -later- use, and not necessarily even within > the current transaction. Hence a freelist block is still considered > free space, not as used space. The difference is that we assume AGFL > blocks can always be used immediately and they aren't constrained by > being busy or have pending discards. > > Also, we have to keep in mind that we can allocate data blocks from > the AGFL in low space situations. Hence it is not safe to place busy > or discard-pending blocks on the AGFL, as this can result in them > being allocated for user data and overwritten before the checkpoint > that marked them busy has been committed to the journal.... > > As such, I don't think it is be safe to ignore busy extent state > just because we are filling the AGFL from the current free space > tree. > > Cheers, > > Dave. -- chandan