From: Ye Bin <yebin10@xxxxxxxxxx> This patch set solve race between '__percpu_counter_compare()' and cpu offline. Before commit 5825bea05265("xfs: __percpu_counter_compare() inode count debug too expensive"). I got issue as follows when do cpu online/offline test: smpboot: CPU 1 is now offline XFS: Assertion failed: percpu_counter_compare(&mp->m_ifree, 0) >= 0, file: fs/xfs/xfs_trans.c, line: 622 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:110! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 25512 Comm: fsstress Not tainted 5.10.0-04288-gcb31bdc8c65d #8 RIP: 0010:assfail+0x77/0x8b fs/xfs/xfs_message.c:110 RSP: 0018:ffff88810a5df5c0 EFLAGS: 00010293 RAX: ffff88810f3a8000 RBX: 0000000000000201 RCX: ffffffffaa8bd7c0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 0000000000000000 R08: ffff88810f3a8000 R09: ffffed103edf71cd R10: ffff8881f6fb8e67 R11: ffffed103edf71cc R12: ffffffffab0108c0 R13: ffffffffab010220 R14: ffffffffffffffff R15: 0000000000000000 FS: 00007f8536e16b80(0000) GS:ffff8881f6f80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005617e1115f44 CR3: 000000015873a005 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: xfs_trans_unreserve_and_mod_sb+0x833/0xca0 fs/xfs/xfs_trans.c:622 xlog_cil_commit+0x1169/0x29b0 fs/xfs/xfs_log_cil.c:1325 __xfs_trans_commit+0x2c0/0xe20 fs/xfs/xfs_trans.c:889 xfs_create_tmpfile+0x6a6/0x9a0 fs/xfs/xfs_inode.c:1320 xfs_rename_alloc_whiteout fs/xfs/xfs_inode.c:3193 [inline] xfs_rename+0x58a/0x1e00 fs/xfs/xfs_inode.c:3245 xfs_vn_rename+0x28e/0x410 fs/xfs/xfs_iops.c:436 vfs_rename+0x10b5/0x1dd0 fs/namei.c:4329 do_renameat2+0xa19/0xb10 fs/namei.c:4474 __do_sys_renameat2 fs/namei.c:4512 [inline] __se_sys_renameat2 fs/namei.c:4509 [inline] __x64_sys_renameat2+0xe4/0x120 fs/namei.c:4509 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x61/0xc6 RIP: 0033:0x7f853623d91d I can reproduce above issue by injecting kernel latency to invalidate the quick judgment of “__percpu_counter_compare()”. For quick judgment logic, the number of CPUs may have decreased before calling percpu_counter_cpu_dead() when concurrent with CPU offline. That leads to calculation errors. For example: Assumption: (1) batch = 32 (2) The final count is 2 (3) The number of CPUs is 4 If the number of percpu variables on each CPU is as follows when CPU3 is offline: cpu0 cpu1 cpu2 cpu3 31 31 31 31 fbc->count = -122 -> 'percpu_counter_cpu_dead()' isn't called. So at this point, check if percpu counter is greater than 0. abs(count - rhs) = -122 batch * num_ online_ cpus() = 32 * 3 = 96 -> Online CPUs number become 3 That is: abs (count rhs) > batch * num_online_cpus() condition met. The actual value is 2, but the fact that count<0 returns -1 is the opposite. Ye Bin (2): cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count lib/percpu_counter: fix dying cpu compare race include/linux/cpumask.h | 20 ++++++++++++++++---- kernel/cpu.c | 2 ++ lib/percpu_counter.c | 11 ++++++++++- 3 files changed, 28 insertions(+), 5 deletions(-) -- 2.31.1