On 2023/11/22 21:19, Vlastimil Babka wrote: > On 11/22/23 12:54, Chengming Zhou wrote: >> On 2023/11/22 19:40, Vlastimil Babka wrote: >>> On 11/22/23 12:35, Chengming Zhou wrote: >>>> On 2023/11/22 17:37, Vlastimil Babka wrote: >>>>> On 11/20/23 19:49, Mark Brown wrote: >>>>>> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@xxxxxxxxx wrote: >>>>>>> From: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> >>>>>>> >>>>>>> Now we will freeze slabs when moving them out of node partial list to >>>>>>> cpu partial list, this method needs two cmpxchg_double operations: >>>>>>> >>>>>>> 1. freeze slab (acquire_slab()) under the node list_lock >>>>>>> 2. get_freelist() when pick used in ___slab_alloc() >>>>>> >>>>>> Recently -next has been failing to boot on a Raspberry Pi 3 with an arm >>>>>> multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this >>>>>> patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having >>>>>> introduced the issue. I've included the full bisect log below. >>>>>> >>>>>> When we see problems we see RCU stalls while logging in, for example: >>>>> >>>>> Can you try this, please? >>>>> >>>> >>>> Great! I manually disabled __CMPXCHG_DOUBLE to reproduce the problem, >>>> and this patch can solve the machine hang problem. >>>> >>>> BTW, I also did the performance testcase on the machine with 128 CPUs. >>>> >>>> stress-ng --rawpkt 128 --rawpkt-ops 100000000 >>>> >>>> base patched >>>> 2.22s 2.35s >>>> 2.21s 3.14s >>>> 2.19s 4.75s >>>> >>>> Found this atomic version performance numbers are not stable. >>> >>> That's weirdly too bad. Is that measured also with __CMPXCHG_DOUBLE >>> disabled, or just the patch? The PG_workingset flag change should be >> >> The performance test is just the patch. >> >>> uncontended as we are doing it under list_lock, and with __CMPXCHG_DOUBLE >>> there should be no interfering PG_locked interference. >>> >> >> Yes, I don't know. Maybe it's related with my kernel config, making the >> atomic operation much expensive? Will look again.. > > I doubt it can explain going from 2.19s to 4.75s, must have been some > interference on the machine? > Yes, it looks so. There are some background services on the 128 CPUs machine. Although "stress-ng --rawpkt 128 --rawpkt-ops 100000000" has so much regression, I tried other less contented testcases: 1. stress-ng --rawpkt 64 --rawpkt-ops 100000000 2. perf bench sched messaging -g 5 -t -l 100000 The performance numbers of this atomic version are pretty much the same. So this atomic version should be good in most cases IMHO. >> And I also tested the atomic-optional version like below, found the >> performance numbers are much stable. > > This gets rather ugly and fragile so I'd maybe rather go back to the > __unused field approach :/ > Agree. If we don't want this atomic version, the __unused field approach seems better. Thanks!