On 2022-01-03 10:04:29 [-0500], Waiman Long wrote: > On 1/3/22 09:44, Sebastian Andrzej Siewior wrote: > > Is there something you recommend as a benchmark where I could get some > > numbers? > > In the case of PREEMPT_DYNAMIC, it depends on the default setting which is > used by most users. I will support disabling the optimization if > defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by > CONFIG_)PREEMPTION alone. > > As for microbenchmark, something that makes a lot of calls to malloc() or > related allocations can be used. Numbers I made: Sandy Bridge Haswell Skylake AMD-A8 7100 Zen2 ARM64 PREEMPT 5,123,896,822 5,215,055,226 5,077,611,590 6,012,287,874 6,234,674,489 20,000,000,100 IRQ 7,494,119,638 6,810,367,629 10,620,130,377 4,178,546,086 4,898,076,012 13,538,461,925 For micro benchmarking I did 1.000.000.000 iterations of preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore() [IRQ]. On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ loop took 7,494,119,638ns. The absolute numbers are not important, it is worth noting that preemption off/on is less expensive than IRQ off/on. Except for AMD and ARM64 where IRQ off/on was less expensive. The whole loop was performed with disabled interrupts so I don't expect much interference - but then I don't know much the µArch optimized away on local_irq_restore() given that the interrupts were already disabled. I don't have any recent Intel HW (I think) so I don't know if this is an Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess that the recent uArch would behave similar to AMD. Moving on: For the test I run 100,000,000 iterations of kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT)); The BH suffix means that in_task() reported false during the allocation, otherwise it reported true. SD is the standard deviation. SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT. OPT means the optimisation (in_task() + task_obj) is active, NO-OPT means no optimisation (irq_obj is always used). The numbers are the time in ns needed for 100,000,000 iteration (alloc + free). I run 5 tests and used the median value here. If the standard deviation exceeded 10^9 then I repeated the test. The values remained in the same range since usually only one value was off and the other remained in the same range. Sandy Bridge SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT ALLOC/FREE 8,519,295,176 9,051,200,652 10,627,431,395 11,198,189,843 SD 5,309,768 29,253,976 129,102,317 40,681,909 ALLOC/FREE BH 9,996,704,330 8,927,026,031 11,680,149,900 11,139,356,465 SD 38,237,534 72,913,120 23,626,932 116,413,331 The optimisation is visible in the SERVER-OPT case (~1.5s difference in the runtime (or 14.7ns per iteration)). There is hardly any difference between BH and !BH in the SERVER-NO-OPT case. For the SERVER case, the optimisation improves ~0.5s in runtime for the !BH case. For the PREEMPT case it also looks like ~0.5s improvement in the BH case while in the BH case it looks the other way around. DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT ALLOC/FREE 11,069,180,584 10,773,407,543 10,963,581,285 10,826,207,969 SD 23,195,912 112,763,104 13,145,589 33,543,625 ALLOC/FREE BH 11,443,342,069 10,720,094,700 11,064,914,727 10,955,883,521 SD 81,150,074 171,299,554 58,603,778 84,131,143 DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is default. I don't see any difference vs CONFIG_PREEMPT except the default preemption state (so I didn't test that). The preemption counter is always forced-in so preempt_enable()/disable() is not optimized away. SRV is the default value (PREEMPT_NONE) and FULL is the overriden (preempt=full) state. Based on that, I don't see any added value by the optimisation once PREEMPT_DYNAMIC is enabled. ---- Zen2: SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT ALLOC/FREE 8,126,735,313 8,751,307,383 9,822,927,142 10,045,105,425 SD 100,806,471 87,234,047 55,170,179 25,832,386 ALLOC/FREE BH 9,197,455,885 8,394,337,053 10,671,227,095 9,904,954,934 SD 155,223,919 57,800,997 47,529,496 105,260,566 On Zen2, the IRQ off/on was less expensive than preempt-off/on. So it looks like I mixed up the numbers in for PREEMPT OPT and NO-OPT but I re-run it twice and nothing significant changed… However the difference on PREEMPT for the !BH case is not as significant as on Sandy Bridge (~200ms here vs ~500ms there). DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT ALLOC/FREE 9,680,498,929 10,180,973,847 9,644,453,405 10,224,416,854 SD 73,944,156 61,850,527 13,277,203 107,145,212 ALLOC/FREE BH 10,680,074,634 9,956,695,323 10,704,442,515 9,942,155,910 SD 75,535,172 34,524,493 54,625,678 87,163,920 For double testing and checking, the full git tree is available at [0] and the script to parse the results is at [1]. [0] git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging memcg [1] https://breakpoint.cc/parse-memcg.py > Cheers, > Longman Sebastian