Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> · Wed, 5 Jan 2022 21:22:56 +0100

On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> > Is there something you recommend as a benchmark where I could get some
> > numbers?
> 
> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
> used by most users. I will support disabling the optimization if
> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
> CONFIG_)PREEMPTION alone.
> 
> As for microbenchmark, something that makes a lot of calls to malloc() or
> related allocations can be used.

Numbers I made:

        Sandy Bridge   Haswell        Skylake         AMD-A8 7100    Zen2           ARM64
PREEMPT 5,123,896,822  5,215,055,226   5,077,611,590  6,012,287,874  6,234,674,489  20,000,000,100
IRQ     7,494,119,638  6,810,367,629  10,620,130,377  4,178,546,086  4,898,076,012  13,538,461,925

For micro benchmarking I did 1.000.000.000 iterations of
preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
[IRQ].
On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
loop took 7,494,119,638ns. The absolute numbers are not important, it is
worth noting that preemption off/on is less expensive than IRQ off/on.
Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
loop was performed with disabled interrupts so I don't expect much
interference - but then I don't know much the µArch optimized away on
local_irq_restore() given that the interrupts were already disabled.
I don't have any recent Intel HW (I think) so I don't know if this is an
Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
that the recent uArch would behave similar to AMD.

Moving on: For the test I run 100,000,000 iterations of
     kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));

The BH suffix means that in_task() reported false during the allocation,
otherwise it reported true.
SD is the standard deviation.
SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
means no optimisation (irq_obj is always used).
The numbers are the time in ns needed for 100,000,000 iteration (alloc +
free). I run 5 tests and used the median value here. If the standard
deviation exceeded 10^9 then I repeated the test. The values remained in
the same range since usually only one value was off and the other
remained in the same range.

Sandy Bridge
                 SERVER OPT   SERVER NO-OPT    PREEMPT OPT     PREEMPT NO-OPT
ALLOC/FREE    8,519,295,176   9,051,200,652    10,627,431,395  11,198,189,843
SD                5,309,768      29,253,976       129,102,317      40,681,909
ALLOC/FREE BH 9,996,704,330   8,927,026,031    11,680,149,900  11,139,356,465
SD               38,237,534      72,913,120        23,626,932     116,413,331

The optimisation is visible in the SERVER-OPT case (~1.5s difference in
the runtime (or 14.7ns per iteration)). There is hardly any difference
between BH and !BH in the SERVER-NO-OPT case.
For the SERVER case, the optimisation improves ~0.5s in runtime for the
!BH case.
For the PREEMPT case it also looks like ~0.5s improvement in the BH case
while in the BH case it looks the other way around.

                 DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT   DYN-FULL-NO-OPT
ALLOC/FREE     11,069,180,584  10,773,407,543  10,963,581,285    10,826,207,969
SD                 23,195,912     112,763,104      13,145,589        33,543,625
ALLOC/FREE BH  11,443,342,069  10,720,094,700  11,064,914,727    10,955,883,521
SD                 81,150,074     171,299,554      58,603,778        84,131,143

DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
default.  I don't see any difference vs CONFIG_PREEMPT except the
default preemption state (so I didn't test that). The preemption counter
is always forced-in so preempt_enable()/disable() is not optimized away.
SRV is the default value (PREEMPT_NONE) and FULL is the overriden
(preempt=full) state.

Based on that, I don't see any added value by the optimisation once
PREEMPT_DYNAMIC is enabled.

----
Zen2:
                 SERVER OPT   SERVER NO-OPT   PREEMPT OPT      PREEMPT NO-OPT
ALLOC/FREE    8,126,735,313   8,751,307,383    9,822,927,142   10,045,105,425
SD              100,806,471      87,234,047       55,170,179       25,832,386
ALLOC/FREE BH 9,197,455,885   8,394,337,053   10,671,227,095    9,904,954,934
SD              155,223,919      57,800,997       47,529,496      105,260,566

On Zen2, the IRQ off/on was less expensive than preempt-off/on. So it
looks like I mixed up the numbers in for PREEMPT OPT and NO-OPT but I
re-run it twice and nothing significant changed… However the difference
on PREEMPT for the !BH case is not as significant as on Sandy Bridge
(~200ms here vs ~500ms there).

                 DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT  DYN-FULL-NO-OPT
ALLOC/FREE      9,680,498,929  10,180,973,847   9,644,453,405  10,224,416,854
SD                 73,944,156      61,850,527      13,277,203     107,145,212
ALLOC/FREE BH  10,680,074,634   9,956,695,323  10,704,442,515   9,942,155,910
SD                 75,535,172      34,524,493      54,625,678      87,163,920

For double testing and checking, the full git tree is available at [0]
and the script to parse the results is at [1].

[0] git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging memcg
[1] https://breakpoint.cc/parse-memcg.py

> Cheers,
> Longman

Sebastian