On 1/5/22 15:22, Sebastian Andrzej Siewior wrote:
On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
Is there something you recommend as a benchmark where I could get some
numbers?
In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
used by most users. I will support disabling the optimization if
defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
CONFIG_)PREEMPTION alone.
As for microbenchmark, something that makes a lot of calls to malloc() or
related allocations can be used.
Numbers I made:
Sandy Bridge Haswell Skylake AMD-A8 7100 Zen2 ARM64
PREEMPT 5,123,896,822 5,215,055,226 5,077,611,590 6,012,287,874 6,234,674,489 20,000,000,100
IRQ 7,494,119,638 6,810,367,629 10,620,130,377 4,178,546,086 4,898,076,012 13,538,461,925
Thanks for the extensive testing. I usually perform my performance test
on Intel hardware. I don't realize that Zen2 and arm64 perform better
with irq on/off.
For micro benchmarking I did 1.000.000.000 iterations of
preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
[IRQ].
On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
loop took 7,494,119,638ns. The absolute numbers are not important, it is
worth noting that preemption off/on is less expensive than IRQ off/on.
Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
loop was performed with disabled interrupts so I don't expect much
interference - but then I don't know much the µArch optimized away on
local_irq_restore() given that the interrupts were already disabled.
I don't have any recent Intel HW (I think) so I don't know if this is an
Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
that the recent uArch would behave similar to AMD.
Moving on: For the test I run 100,000,000 iterations of
kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));
The BH suffix means that in_task() reported false during the allocation,
otherwise it reported true.
SD is the standard deviation.
SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
means no optimisation (irq_obj is always used).
The numbers are the time in ns needed for 100,000,000 iteration (alloc +
free). I run 5 tests and used the median value here. If the standard
deviation exceeded 10^9 then I repeated the test. The values remained in
the same range since usually only one value was off and the other
remained in the same range.
Sandy Bridge
SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT
ALLOC/FREE 8,519,295,176 9,051,200,652 10,627,431,395 11,198,189,843
SD 5,309,768 29,253,976 129,102,317 40,681,909
ALLOC/FREE BH 9,996,704,330 8,927,026,031 11,680,149,900 11,139,356,465
SD 38,237,534 72,913,120 23,626,932 116,413,331
My own testing when tracking the number of times in_task() is true or
false indicated most of the kmalloc() call is done by tasks. Only a few
percents of the time is in_task() false. That is the reason why I
optimize the case that in_task() is true.
The optimisation is visible in the SERVER-OPT case (~1.5s difference in
the runtime (or 14.7ns per iteration)). There is hardly any difference
between BH and !BH in the SERVER-NO-OPT case.
For the SERVER case, the optimisation improves ~0.5s in runtime for the
!BH case.
For the PREEMPT case it also looks like ~0.5s improvement in the BH case
while in the BH case it looks the other way around.
DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT
ALLOC/FREE 11,069,180,584 10,773,407,543 10,963,581,285 10,826,207,969
SD 23,195,912 112,763,104 13,145,589 33,543,625
ALLOC/FREE BH 11,443,342,069 10,720,094,700 11,064,914,727 10,955,883,521
SD 81,150,074 171,299,554 58,603,778 84,131,143
DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
default. I don't see any difference vs CONFIG_PREEMPT except the
default preemption state (so I didn't test that). The preemption counter
is always forced-in so preempt_enable()/disable() is not optimized away.
SRV is the default value (PREEMPT_NONE) and FULL is the overriden
(preempt=full) state.
Based on that, I don't see any added value by the optimisation once
PREEMPT_DYNAMIC is enabled.
The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data
points, I am not going to object to this patch then. I will try to look
further into why this is the case when I have time.
Cheers,
Longman