[add lfsdevel to cc to spread the, um, love] TLDR: percpu_counter_sum doesn't add in the values from CPUs in the dying mask. As a result, the summation can race with cpu hotunplug and return the wrong values. On Wed, Mar 15, 2023 at 09:13:05AM +1100, Dave Chinner wrote: > On Tue, Mar 14, 2023 at 09:31:00AM -0700, Darrick J. Wong wrote: > > On Tue, Mar 14, 2023 at 05:06:49PM +0800, Ye Bin wrote: > > > From: Ye Bin <yebin10@xxxxxxxxxx> > > > > > > There's a issue when do cpu offline test: > > > CPU: 48 PID: 1168152 Comm: umount Kdump: loaded Tainted: G L > > > pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) > > > pc : assfail+0x8c/0xb4 > > > lr : assfail+0x38/0xb4 > > > sp : ffffa00033ce7c40 > > > x29: ffffa00033ce7c40 x28: ffffa00014794f30 > > > x27: ffffa00014f6ca20 x26: 1fffe0120b2e2030 > > > x25: ffff009059710188 x24: ffff00886c0a4650 > > > x23: 1fffe0110d8148ca x22: ffff009059710180 > > > x21: ffffa00015155680 x20: ffff00886c0a4000 > > > x19: 0000000000000001 x18: 0000000000000000 > > > x17: 0000000000000000 x16: 0000000000000000 > > > x15: 0000000000000007 x14: 1fffe00304cef265 > > > x13: ffff00182642b200 x12: ffff8012d37757bf > > > x11: 1fffe012d37757be x10: ffff8012d37757be > > > x9 : ffffa00010603a0c x8 : 0000000041b58ab3 > > > x7 : ffff94000679cf44 x6 : 00000000ffffffc0 > > > x5 : 0000000000000021 x4 : 00000000ffffffca > > > x3 : 1ffff40002a27ee1 x2 : 0000000000000004 > > > x1 : 0000000000000000 x0 : ffffa0001513f000 > > > Call trace: > > > assfail+0x8c/0xb4 > > > xfs_destroy_percpu_counters+0x98/0xa4 > > > xfs_fs_put_super+0x1a0/0x2a4 > > > generic_shutdown_super+0x104/0x2c0 > > > kill_block_super+0x8c/0xf4 > > > deactivate_locked_super+0xa4/0x164 > > > deactivate_super+0xb0/0xdc > > > cleanup_mnt+0x29c/0x3ec > > > __cleanup_mnt+0x1c/0x30 > > > task_work_run+0xe0/0x200 > > > do_notify_resume+0x244/0x320 > > > work_pending+0xc/0xa0 > > > > > > We analyzed the data in vmcore is correct. But triggered above issue. > > > As f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface") > > > commit describes there is a small race window between the online CPUs traversal > > > of percpu_counter_sum and the CPU offline callback. This means percpu_counter_sum() > > > may return incorrect result during cpu offline. > > > To solve above issue use percpu_counter_sum_all() interface to make sure > > > result is correct to prevent false triggering of assertions. > > > > How about the other percpu_counter_sum callsites inside XFS? Some of > > them are involved in writing ondisk metadata (xfs_log_sb) or doing > > correctness checks (fs/xfs/scrub/*); shouldn't those also be using the > > _all variant? > > Ugh. I kinda wish that the percpu_counter_sum_all() patch had been > cc'd to lists for subsystems that use percpu_counter_sum() > extensively, or just to people who have modified that code in the > past. > > The problem is that it uses cpu_possible_mask, which means it > walks all possible CPUs that can be added to the system even if the > CPUs aren't physically present. That can be a lot in the case of > systems that can have cpu-capable nodes hotplugged into them, and > that makes percpu_counter_sum_all() excitingly expensive for no good > reason. > > AFAICT, if we are trying to close a race condition between iterating > online CPUs not summing dying CPUs and the cpu dead notification > updating the sum, then shouldn't we be using > cpu_mask_or(cpu_online_mask, cpu_dying_mask) for summing iteration > rather than just cpu_online_mask? > > i.e. when a CPU is being taken down, it gets added to the > cpu_dying_mask, then removed from the cpu_online_mask, then the > offline notifications are run (i.e. the percpu counter dead > callback), and when the CPU reaches the CPUHP_TEARDOWN_CPU state, > it is removed from the cpu_dying_mask because it is now dead and all > the "cpu dying" callbacks have been run. > > Except when a hotplug event is being processed, cpu_dying_mask will > be empty, hence there is little change in summing overhead. But it > will close the summing race condition when a CPU is being > offlined... > > That, I think, is the solution we want for XFS. Having the percpu > counters just do the right thing is far better than always having to > wonder if summation interface we are using is correct in the face of > CPU hotplug. I'll put a patchset together to do: > > 1. fix percpu_counter_sum() to include the dying mask in it's > iteration. This should fix the XFS issue. I took a quick look at ext4 and btrfs usage of percpu_counter_sum. I /think/ they're less impacted because most of the usage seems to be for things like statfs which are inherently racy. That said, mixing in the dying mask sounds like a cheap fix. > 2. change the only user of percpu_counter_sum_all() to only use > percpu_counter_sum() because percpu_counter_sum_all() is now > redundant. > 3. remove percpu_counter_sum_all() because it is unused. Sounds reasonable to /me. :) --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx