Re: [PATCH 2/3] alloc_tag: uninline code gated by mem_alloc_profiling_key in slab allocator

Vlastimil Babka <vbabka@xxxxxxx> · Wed, 29 Jan 2025 10:38:04 +0100

On 1/29/25 03:54, Suren Baghdasaryan wrote:
> On Tue, Jan 28, 2025 at 3:43 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>>
>> On Tue, Jan 28, 2025 at 11:35 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>> >
>> > On Mon, 27 Jan 2025 11:38:32 -0800
>> > Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>> >
>> > > On Sun, Jan 26, 2025 at 8:47 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>> > > >
>> > > > On 1/26/25 08:02, Suren Baghdasaryan wrote:
>> > > > > When a sizable code section is protected by a disabled static key, that
>> > > > > code gets into the instruction cache even though it's not executed and
>> > > > > consumes the cache, increasing cache misses. This can be remedied by
>> > > > > moving such code into a separate uninlined function. The improvement
>> > >
>> > > Sorry, I missed adding Steven Rostedt into the CC list since his
>> > > advice was instrumental in finding the way to optimize the static key
>> > > performance in this patch. Added now.
>> > >
>> > > >
>> > > > Weird, I thought the static_branch_likely/unlikely/maybe was already
>> > > > handling this by the unlikely case being a jump to a block away from the
>> > > > fast-path stream of instructions, thus making it less likely to get cached.
>> > > > AFAIU even plain likely()/unlikely() should do this, along with branch
>> > > > prediction hints.
>> > >
>> > > This was indeed an unexpected overhead when I measured it on Android.
>> > > Cache pollution was my understanding of the cause for this high
>> > > overhead after Steven told me to try uninlining the protected code. He
>> > > has done something similar in the tracing subsystem. But maybe I
>> > > misunderstood the real reason. Steven, could you please verify if my
>> > > understanding of the high overhead cause is correct here? Maybe there
>> > > is something else at play that I missed?
>> >
>> > From what I understand, is that the compiler will only move code to the end
>> > of a function with the unlikely(). But, the code after the function could
>> > also be in the control flow path. If you have several functions that are
>> > called together, by adding code to the unlikely() cases may not help the
>> > speed.
>> >
>> > I made an effort to make the tracepoint code call functions instead of
>> > having everything inlined. It actually brought down the size of the text of
>> > the kernel, but looking in the change logs I never posted benchmarks. But
>> > I'm sure making the size of the scheduler text section smaller probably did
>> > help.
>> >
>> > > > That would be in line with my understanding above. Does the arm64 compiler
>> > > > not do it as well as x86 (could be maybe found out by disassembling) or the
>> > > > Pixel6 cpu somhow caches these out of line blocks more aggressively and only
>> > > > a function call stops it?
>> > >
>> > > I'll disassemble the code and will see what it looks like.
>> >
>> > I think I asked you to do that too ;-)
>>
>> Yes you did! And I disassembled almost each of these functions during
>> my investigation but in my infinite wisdom I did not save any of them.
>> So, now I need to do that again to answer Vlastimil's question. I'll
>> try to do that today.
> 
> Yeah, quite a difference. This is alloc_tagging_slab_alloc_hook() with
> outlined version of __alloc_tagging_slab_alloc_hook():

Not fluent in arm64 assembly but let's see...

> ffffffc0803a2dd8 <alloc_tagging_slab_alloc_hook>:
> ffffffc0803a2dd8: d503201f      nop
> ffffffc0803a2ddc: d65f03c0      ret

So that's an immediate return unless static key rewrites the nop.

BTW, I wouldn't expect the alloc_tagging_slab_alloc_hook() to exist as a
separate function in the first place, since it's "static inline". It seems
weird to do a function call to a static key test. We should perhaps force
inline it.

> ffffffc0803a2de0: d503233f      paciasp
> ffffffc0803a2de4: a9bf7bfd      stp x29, x30, [sp, #-0x10]!
> ffffffc0803a2de8: 910003fd      mov x29, sp
> ffffffc0803a2dec: 94000004      bl 0xffffffc0803a2dfc
> <__alloc_tagging_slab_alloc_hook>
> ffffffc0803a2df0: a8c17bfd      ldp x29, x30, [sp], #0x10
> ffffffc0803a2df4: d50323bf      autiasp
> ffffffc0803a2df8: d65f03c0      ret
> 
> This is the same function with inlined version of
> __alloc_tagging_slab_alloc_hook():
> 
> ffffffc0803a2dd8 <alloc_tagging_slab_alloc_hook>:
> ffffffc0803a2dd8: d503233f      paciasp
> ffffffc0803a2ddc: d10103ff      sub sp, sp, #0x40
> ffffffc0803a2de0: a9017bfd      stp x29, x30, [sp, #0x10]
> ffffffc0803a2de4: f90013f5      str x21, [sp, #0x20]
> ffffffc0803a2de8: a9034ff4      stp x20, x19, [sp, #0x30]
> ffffffc0803a2dec: 910043fd      add x29, sp, #0x10
> ffffffc0803a2df0: d503201f      nop
> ffffffc0803a2df4: a9434ff4      ldp x20, x19, [sp, #0x30]
> ffffffc0803a2df8: f94013f5      ldr x21, [sp, #0x20]
> ffffffc0803a2dfc: a9417bfd      ldp x29, x30, [sp, #0x10]
> ffffffc0803a2e00: 910103ff      add sp, sp, #0x40
> ffffffc0803a2e04: d50323bf      autiasp
> ffffffc0803a2e08: d65f03c0      ret

Seems to me this will also return unless the nop is rewritten, but instead
of making a call reachable there will be a jump to below?
Now is the overhead larger because the code below gets cached, or because
the block above is doing more in the disabled case? It looks quite suboptimal.

> ffffffc0803a2e0c: b4ffff41      cbz x1, 0xffffffc0803a2df4
> <alloc_tagging_slab_alloc_hook+0x1c>
> ffffffc0803a2e10: b9400808      ldr w8, [x0, #0x8]
> ffffffc0803a2e14: 12060049      and w9, w2, #0x4000000
> ffffffc0803a2e18: 12152108      and w8, w8, #0xff800
> ffffffc0803a2e1c: 120d6108      and w8, w8, #0xfff80fff
> ffffffc0803a2e20: 2a090108      orr w8, w8, w9