On Tue, Jan 26, 2021 at 02:59:18PM +0100, Michal Hocko wrote: > > > This thread shows that this is still somehow related to performance but > > > the real reason is not clear. I believe we should be focusing on the > > > actual reasons for the performance impact than playing with some fancy > > > math and tuning for a benchmark on a particular machine which doesn't > > > work for others due to subtle initialization timing issues. > > > > > > Fundamentally why should higher number of CPUs imply the size of slab in > > > the first place? > > > > A 1st answer is that the activity and the number of threads involved > > scales with the number of CPUs. Regarding the hackbench benchmark as > > an example, the number of group/threads raise to a higher level on the > > server than on the small system which doesn't seem unreasonable. > > > > On 8 CPUs, I run hackbench with up to 16 groups which means 16*40 > > threads. But I raise up to 256 groups, which means 256*40 threads, on > > the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't > > regress on the 224 CPUs system. The next test with 4 groups starts > > to regress by -7%. But the next one: hackbench -g 16 regresses by 187% > > (duration is almost 3 times longer). It seems reasonable to assume > > that the number of running threads and resources scale with the number > > of CPUs because we want to run more stuff. > > OK, I do understand that more jobs scale with the number of CPUs but I > would also expect that higher order pages are generally more expensive > to get so this is not really a clear cut especially under some more > demand on the memory where allocations are smooth. So the question > really is whether this is not just optimizing for artificial conditions. The flip side is that smaller orders increase zone lock contention and contention can csale with the number of CPUs so it's partially related. hackbench-sockets is an extreme case (pipetest is not affected) but it's the messenger in this case. On a x86-64 2-socket 40 core (80 threads) machine then comparing a revert of the patch with vanilla 5.11-rc5 is hackbench-process-sockets 5.11-rc5 5.11-rc5 revert-lockstat vanilla-lockstat Amean 1 1.1560 ( 0.00%) 1.0633 * 8.02%* Amean 4 2.0797 ( 0.00%) 2.5470 * -22.47%* Amean 7 3.2693 ( 0.00%) 4.3433 * -32.85%* Amean 12 5.2043 ( 0.00%) 6.5600 * -26.05%* Amean 21 10.5817 ( 0.00%) 11.3320 * -7.09%* Amean 30 13.3923 ( 0.00%) 15.5817 * -16.35%* Amean 48 20.3893 ( 0.00%) 23.6733 * -16.11%* Amean 79 31.4210 ( 0.00%) 38.2787 * -21.83%* Amean 110 43.6177 ( 0.00%) 53.8847 * -23.54%* Amean 141 56.3840 ( 0.00%) 68.4257 * -21.36%* Amean 172 70.0577 ( 0.00%) 85.0077 * -21.34%* Amean 203 81.9717 ( 0.00%) 100.7137 * -22.86%* Amean 234 95.1900 ( 0.00%) 116.0280 * -21.89%* Amean 265 108.9097 ( 0.00%) 130.4307 * -19.76%* Amean 296 119.7470 ( 0.00%) 142.3637 * -18.89%* i.e. the patch incurs a 7% to 32% performance penalty. This bisected cleanly yesterday when I was looking for the regression and then found the thread. Numerous caches change size. For example, kmalloc-512 goes from order-0 (vanilla) to order-2 with the revert. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- VANILLA &zone->lock: 1202731 1203433 0.07 120.55 1555485.48 1.29 8920825 12537091 0.06 84.10 9855085.12 0.79 ----------- &zone->lock 61903 [<00000000b47dc96a>] free_one_page+0x3f/0x530 &zone->lock 7655 [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370 &zone->lock 36529 [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0 &zone->lock 1097346 [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370 ----------- &zone->lock 44716 [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370 &zone->lock 69813 [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0 &zone->lock 31596 [<00000000b47dc96a>] free_one_page+0x3f/0x530 &zone->lock 1057308 [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370 REVERT &zone->lock: 735827 739037 0.06 66.12 699661.56 0.95 4095299 7757942 0.05 54.35 5670083.68 0.73 ----------- &zone->lock 101927 [<00000000a60d5f86>] free_one_page+0x3f/0x530 &zone->lock 626426 [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370 &zone->lock 9207 [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0 &zone->lock 1477 [<00000000f856e720>] get_page_from_freelist+0x475/0x1370 ----------- &zone->lock 6249 [<00000000f856e720>] get_page_from_freelist+0x475/0x1370 &zone->lock 92224 [<00000000a60d5f86>] free_one_page+0x3f/0x530 &zone->lock 19690 [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0 &zone->lock 620874 [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370 Each individual wait time is small but the maximum waittime-max is roughly double (120us vanilla vs 66us reverting the patch). Total wait time is roughly doubled also due to the patch. Acquisitions are almost doubled. So mostly this is down to the number of times SLUB calls into the page allocator which only caches order-0 pages on a per-cpu basis. I do have a prototype for a high-order per-cpu allocator but it is very rough -- high watermarks stop making sense, code is rough, memory needed for the pcpu structures quadruples etc. -- Mel Gorman SUSE Labs