Re: [PATCH 2/3] alloc_tag: uninline code gated by mem_alloc_profiling_key in slab allocator

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Tue, 28 Jan 2025 18:54:42 -0800

On Tue, Jan 28, 2025 at 3:43 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Tue, Jan 28, 2025 at 11:35 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> > On Mon, 27 Jan 2025 11:38:32 -0800
> > Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> > > On Sun, Jan 26, 2025 at 8:47 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
> > > >
> > > > On 1/26/25 08:02, Suren Baghdasaryan wrote:
> > > > > When a sizable code section is protected by a disabled static key, that
> > > > > code gets into the instruction cache even though it's not executed and
> > > > > consumes the cache, increasing cache misses. This can be remedied by
> > > > > moving such code into a separate uninlined function. The improvement
> > >
> > > Sorry, I missed adding Steven Rostedt into the CC list since his
> > > advice was instrumental in finding the way to optimize the static key
> > > performance in this patch. Added now.
> > >
> > > >
> > > > Weird, I thought the static_branch_likely/unlikely/maybe was already
> > > > handling this by the unlikely case being a jump to a block away from the
> > > > fast-path stream of instructions, thus making it less likely to get cached.
> > > > AFAIU even plain likely()/unlikely() should do this, along with branch
> > > > prediction hints.
> > >
> > > This was indeed an unexpected overhead when I measured it on Android.
> > > Cache pollution was my understanding of the cause for this high
> > > overhead after Steven told me to try uninlining the protected code. He
> > > has done something similar in the tracing subsystem. But maybe I
> > > misunderstood the real reason. Steven, could you please verify if my
> > > understanding of the high overhead cause is correct here? Maybe there
> > > is something else at play that I missed?
> >
> > From what I understand, is that the compiler will only move code to the end
> > of a function with the unlikely(). But, the code after the function could
> > also be in the control flow path. If you have several functions that are
> > called together, by adding code to the unlikely() cases may not help the
> > speed.
> >
> > I made an effort to make the tracepoint code call functions instead of
> > having everything inlined. It actually brought down the size of the text of
> > the kernel, but looking in the change logs I never posted benchmarks. But
> > I'm sure making the size of the scheduler text section smaller probably did
> > help.
> >
> > > > That would be in line with my understanding above. Does the arm64 compiler
> > > > not do it as well as x86 (could be maybe found out by disassembling) or the
> > > > Pixel6 cpu somhow caches these out of line blocks more aggressively and only
> > > > a function call stops it?
> > >
> > > I'll disassemble the code and will see what it looks like.
> >
> > I think I asked you to do that too ;-)
>
> Yes you did! And I disassembled almost each of these functions during
> my investigation but in my infinite wisdom I did not save any of them.
> So, now I need to do that again to answer Vlastimil's question. I'll
> try to do that today.

Yeah, quite a difference. This is alloc_tagging_slab_alloc_hook() with
outlined version of __alloc_tagging_slab_alloc_hook():

ffffffc0803a2dd8 <alloc_tagging_slab_alloc_hook>:
ffffffc0803a2dd8: d503201f      nop
ffffffc0803a2ddc: d65f03c0      ret
ffffffc0803a2de0: d503233f      paciasp
ffffffc0803a2de4: a9bf7bfd      stp x29, x30, [sp, #-0x10]!
ffffffc0803a2de8: 910003fd      mov x29, sp
ffffffc0803a2dec: 94000004      bl 0xffffffc0803a2dfc
<__alloc_tagging_slab_alloc_hook>
ffffffc0803a2df0: a8c17bfd      ldp x29, x30, [sp], #0x10
ffffffc0803a2df4: d50323bf      autiasp
ffffffc0803a2df8: d65f03c0      ret

This is the same function with inlined version of
__alloc_tagging_slab_alloc_hook():

ffffffc0803a2dd8 <alloc_tagging_slab_alloc_hook>:
ffffffc0803a2dd8: d503233f      paciasp
ffffffc0803a2ddc: d10103ff      sub sp, sp, #0x40
ffffffc0803a2de0: a9017bfd      stp x29, x30, [sp, #0x10]
ffffffc0803a2de4: f90013f5      str x21, [sp, #0x20]
ffffffc0803a2de8: a9034ff4      stp x20, x19, [sp, #0x30]
ffffffc0803a2dec: 910043fd      add x29, sp, #0x10
ffffffc0803a2df0: d503201f      nop
ffffffc0803a2df4: a9434ff4      ldp x20, x19, [sp, #0x30]
ffffffc0803a2df8: f94013f5      ldr x21, [sp, #0x20]
ffffffc0803a2dfc: a9417bfd      ldp x29, x30, [sp, #0x10]
ffffffc0803a2e00: 910103ff      add sp, sp, #0x40
ffffffc0803a2e04: d50323bf      autiasp
ffffffc0803a2e08: d65f03c0      ret
ffffffc0803a2e0c: b4ffff41      cbz x1, 0xffffffc0803a2df4
<alloc_tagging_slab_alloc_hook+0x1c>
ffffffc0803a2e10: b9400808      ldr w8, [x0, #0x8]
ffffffc0803a2e14: 12060049      and w9, w2, #0x4000000
ffffffc0803a2e18: 12152108      and w8, w8, #0xff800
ffffffc0803a2e1c: 120d6108      and w8, w8, #0xfff80fff
ffffffc0803a2e20: 2a090108      orr w8, w8, w9
ffffffc0803a2e24: 35fffe88      cbnz w8, 0xffffffc0803a2df4
<alloc_tagging_slab_alloc_hook+0x1c>
ffffffc0803a2e28: d378dc28      lsl x8, x1, #8
ffffffc0803a2e2c: d2c01009      mov x9, #0x8000000000 // =549755813888
ffffffc0803a2e30: f9000fa0      str x0, [x29, #0x18]
ffffffc0803a2e34: f90007e1      str x1, [sp, #0x8]
ffffffc0803a2e38: 8b882128      add x8, x9, x8, asr #8
ffffffc0803a2e3c: b25f7be9      mov x9, #-0x200000000 // =-8589934592
ffffffc0803a2e40: f2b80009      movk x9, #0xc000, lsl #16
ffffffc0803a2e44: d34cfd08      lsr x8, x8, #12
ffffffc0803a2e48: 8b081928      add x8, x9, x8, lsl #6
ffffffc0803a2e4c: f9400509      ldr x9, [x8, #0x8]
ffffffc0803a2e50: d100052a      sub x10, x9, #0x1
ffffffc0803a2e54: 7200013f      tst w9, #0x1
ffffffc0803a2e58: 9a8a0108      csel x8, x8, x10, eq
ffffffc0803a2e5c: 3940cd09      ldrb w9, [x8, #0x33]
ffffffc0803a2e60: 7103d53f      cmp w9, #0xf5
ffffffc0803a2e64: 9a9f0113      csel x19, x8, xzr, eq
ffffffc0803a2e68: f9401e68      ldr x8, [x19, #0x38]
ffffffc0803a2e6c: f1001d1f      cmp x8, #0x7
ffffffc0803a2e70: 540000a8      b.hi 0xffffffc0803a2e84
<alloc_tagging_slab_alloc_hook+0xac>
ffffffc0803a2e74: aa1303e0      mov x0, x19
ffffffc0803a2e78: 2a1f03e3      mov w3, wzr
ffffffc0803a2e7c: 97ffd6a5      bl 0xffffffc080398910 <alloc_slab_obj_exts>
ffffffc0803a2e80: 350009c0      cbnz w0, 0xffffffc0803a2fb8
<alloc_tagging_slab_alloc_hook+0x1e0>
ffffffc0803a2e84: b000f2c8      adrp x8, 0xffffffc0821fb000
<max_load_balance_interval>
ffffffc0803a2e88: f9401e6a      ldr x10, [x19, #0x38]
ffffffc0803a2e8c: f9453909      ldr x9, [x8, #0xa70]
ffffffc0803a2e90: 927df148      and x8, x10, #0xfffffffffffffff8
ffffffc0803a2e94: b40000e9      cbz x9, 0xffffffc0803a2eb0
<alloc_tagging_slab_alloc_hook+0xd8>
ffffffc0803a2e98: f94007ea      ldr x10, [sp, #0x8]
ffffffc0803a2e9c: cb090149      sub x9, x10, x9
ffffffc0803a2ea0: f142013f      cmp x9, #0x80, lsl #12 // =0x80000
ffffffc0803a2ea4: 54000062      b.hs 0xffffffc0803a2eb0
<alloc_tagging_slab_alloc_hook+0xd8>
ffffffc0803a2ea8: aa1f03e9      mov x9, xzr
ffffffc0803a2eac: 14000015      b 0xffffffc0803a2f00
<alloc_tagging_slab_alloc_hook+0x128>
ffffffc0803a2eb0: d2ffe009      mov x9, #-0x100000000000000 //
=-72057594037927936
ffffffc0803a2eb4: 14000002      b 0xffffffc0803a2ebc
<alloc_tagging_slab_alloc_hook+0xe4>
ffffffc0803a2eb8: aa1f03e9      mov x9, xzr
ffffffc0803a2ebc: d2dffa0a      mov x10, #0xffd000000000 // =281268818280448
ffffffc0803a2ec0: f2e01fea      movk x10, #0xff, lsl #48
ffffffc0803a2ec4: 8b13194a      add x10, x10, x19, lsl #6
ffffffc0803a2ec8: 9274ad4a      and x10, x10, #0xfffffffffff000
ffffffc0803a2ecc: aa0a012a      orr x10, x9, x10
ffffffc0803a2ed0: f9400fa9      ldr x9, [x29, #0x18]
ffffffc0803a2ed4: f940112b      ldr x11, [x9, #0x20]
ffffffc0803a2ed8: f94007e9      ldr x9, [sp, #0x8]
ffffffc0803a2edc: cb0a0129      sub x9, x9, x10
ffffffc0803a2ee0: d360fd6c      lsr x12, x11, #32
ffffffc0803a2ee4: 9bab7d2a      umull x10, w9, w11
ffffffc0803a2ee8: d368fd6b      lsr x11, x11, #40
ffffffc0803a2eec: d360fd4a      lsr x10, x10, #32
ffffffc0803a2ef0: 4b0a0129      sub w9, w9, w10
ffffffc0803a2ef4: 1acc2529      lsr w9, w9, w12
ffffffc0803a2ef8: 0b0a0129      add w9, w9, w10
ffffffc0803a2efc: 1acb2529      lsr w9, w9, w11
ffffffc0803a2f00: ab091109      adds x9, x8, x9, lsl #4
ffffffc0803a2f04: f9400fa8      ldr x8, [x29, #0x18]
ffffffc0803a2f08: 54fff760      b.eq 0xffffffc0803a2df4
<alloc_tagging_slab_alloc_hook+0x1c>
ffffffc0803a2f0c: b1002129      adds x9, x9, #0x8
ffffffc0803a2f10: 54fff720      b.eq 0xffffffc0803a2df4
<alloc_tagging_slab_alloc_hook+0x1c>
ffffffc0803a2f14: d5384113      mrs x19, SP_EL0
ffffffc0803a2f18: f9402a74      ldr x20, [x19, #0x50]
ffffffc0803a2f1c: b4fff6d4      cbz x20, 0xffffffc0803a2df4
<alloc_tagging_slab_alloc_hook+0x1c>
ffffffc0803a2f20: b9401915      ldr w21, [x8, #0x18]
ffffffc0803a2f24: f9000134      str x20, [x9]
ffffffc0803a2f28: b9401268      ldr w8, [x19, #0x10]
ffffffc0803a2f2c: 11000508      add w8, w8, #0x1
ffffffc0803a2f30: b9001268      str w8, [x19, #0x10]
ffffffc0803a2f34: f9401288      ldr x8, [x20, #0x20]
ffffffc0803a2f38: d538d089      mrs x9, TPIDR_EL1
ffffffc0803a2f3c: 8b090108      add x8, x8, x9
ffffffc0803a2f40: 52800029      mov w9, #0x1        // =1
ffffffc0803a2f44: 91002108      add x8, x8, #0x8
ffffffc0803a2f48: c85f7d0b      ldxr x11, [x8]
ffffffc0803a2f4c: 8b09016b      add x11, x11, x9
ffffffc0803a2f50: c80a7d0b      stxr w10, x11, [x8]
ffffffc0803a2f54: 35ffffaa      cbnz w10, 0xffffffc0803a2f48
<alloc_tagging_slab_alloc_hook+0x170>
ffffffc0803a2f58: f9400a68      ldr x8, [x19, #0x10]
ffffffc0803a2f5c: f1000508      subs x8, x8, #0x1
ffffffc0803a2f60: b9001268      str w8, [x19, #0x10]
ffffffc0803a2f64: 540003c0      b.eq 0xffffffc0803a2fdc
<alloc_tagging_slab_alloc_hook+0x204>
ffffffc0803a2f68: f9400a68      ldr x8, [x19, #0x10]
ffffffc0803a2f6c: b4000388      cbz x8, 0xffffffc0803a2fdc
<alloc_tagging_slab_alloc_hook+0x204>
ffffffc0803a2f70: b9401268      ldr w8, [x19, #0x10]
ffffffc0803a2f74: 11000508      add w8, w8, #0x1
ffffffc0803a2f78: b9001268      str w8, [x19, #0x10]
ffffffc0803a2f7c: f9401288      ldr x8, [x20, #0x20]
ffffffc0803a2f80: d538d089      mrs x9, TPIDR_EL1
ffffffc0803a2f84: 8b080128      add x8, x9, x8
ffffffc0803a2f88: c85f7d0a      ldxr x10, [x8]
ffffffc0803a2f8c: 8b15014a      add x10, x10, x21
ffffffc0803a2f90: c8097d0a      stxr w9, x10, [x8]
ffffffc0803a2f94: 35ffffa9      cbnz w9, 0xffffffc0803a2f88
<alloc_tagging_slab_alloc_hook+0x1b0>
ffffffc0803a2f98: f9400a68      ldr x8, [x19, #0x10]
ffffffc0803a2f9c: f1000508      subs x8, x8, #0x1
ffffffc0803a2fa0: b9001268      str w8, [x19, #0x10]
ffffffc0803a2fa4: 54000060      b.eq 0xffffffc0803a2fb0
<alloc_tagging_slab_alloc_hook+0x1d8>
ffffffc0803a2fa8: f9400a68      ldr x8, [x19, #0x10]
ffffffc0803a2fac: b5fff248      cbnz x8, 0xffffffc0803a2df4
<alloc_tagging_slab_alloc_hook+0x1c>
ffffffc0803a2fb0: 94344478      bl 0xffffffc0810b4190 <preempt_schedule_notrace>
ffffffc0803a2fb4: 17ffff90      b 0xffffffc0803a2df4
<alloc_tagging_slab_alloc_hook+0x1c>
ffffffc0803a2fb8: f9400fa8      ldr x8, [x29, #0x18]
ffffffc0803a2fbc: f00092c0      adrp x0, 0xffffffc0815fd000
<f_midi_shortname+0x4cf4>
ffffffc0803a2fc0: 910e5400      add x0, x0, #0x395
ffffffc0803a2fc4: d00099c1      adrp x1, 0xffffffc0816dc000 <longname+0x2727d>
ffffffc0803a2fc8: 911d1421      add x1, x1, #0x745
ffffffc0803a2fcc: f9403102      ldr x2, [x8, #0x60]
ffffffc0803a2fd0: 97f46d47      bl 0xffffffc0800be4ec <__warn_printk>
ffffffc0803a2fd4: d4210000      brk #0x800
ffffffc0803a2fd8: 17ffff87      b 0xffffffc0803a2df4
<alloc_tagging_slab_alloc_hook+0x1c>
ffffffc0803a2fdc: 9434446d      bl 0xffffffc0810b4190 <preempt_schedule_notrace>
ffffffc0803a2fe0: 17ffffe4      b 0xffffffc0803a2f70
<alloc_tagging_slab_alloc_hook+0x198>

>
> >
> > >
> > > >
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > > >
> > > > Kinda sad that despite the static key we have to control a lot by the
> > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT in addition.
> > >
> > > I agree. If there is a better way to fix this regression I'm open to
> > > changes. Let's wait for Steven to confirm my understanding before
> > > proceeding.
> >
> > How slow is it to always do the call instead of inlining?
>
> Let's see... The additional overhead if we always call is:
>
> Little core: 2.42%
> Middle core: 1.23%
> Big core: 0.66%
>
> Not a huge deal because the overhead of memory profiling when enabled
> is much higher. So, maybe for simplicity I should indeed always call?
>
> >
> > -- Steve