On Tue, Jan 28, 2025 at 3:43 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Tue, Jan 28, 2025 at 11:35 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > > > > On Mon, 27 Jan 2025 11:38:32 -0800 > > Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > On Sun, Jan 26, 2025 at 8:47 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > > > > > > > On 1/26/25 08:02, Suren Baghdasaryan wrote: > > > > > When a sizable code section is protected by a disabled static key, that > > > > > code gets into the instruction cache even though it's not executed and > > > > > consumes the cache, increasing cache misses. This can be remedied by > > > > > moving such code into a separate uninlined function. The improvement > > > > > > Sorry, I missed adding Steven Rostedt into the CC list since his > > > advice was instrumental in finding the way to optimize the static key > > > performance in this patch. Added now. > > > > > > > > > > > Weird, I thought the static_branch_likely/unlikely/maybe was already > > > > handling this by the unlikely case being a jump to a block away from the > > > > fast-path stream of instructions, thus making it less likely to get cached. > > > > AFAIU even plain likely()/unlikely() should do this, along with branch > > > > prediction hints. > > > > > > This was indeed an unexpected overhead when I measured it on Android. > > > Cache pollution was my understanding of the cause for this high > > > overhead after Steven told me to try uninlining the protected code. He > > > has done something similar in the tracing subsystem. But maybe I > > > misunderstood the real reason. Steven, could you please verify if my > > > understanding of the high overhead cause is correct here? Maybe there > > > is something else at play that I missed? > > > > From what I understand, is that the compiler will only move code to the end > > of a function with the unlikely(). But, the code after the function could > > also be in the control flow path. If you have several functions that are > > called together, by adding code to the unlikely() cases may not help the > > speed. > > > > I made an effort to make the tracepoint code call functions instead of > > having everything inlined. It actually brought down the size of the text of > > the kernel, but looking in the change logs I never posted benchmarks. But > > I'm sure making the size of the scheduler text section smaller probably did > > help. > > > > > > That would be in line with my understanding above. Does the arm64 compiler > > > > not do it as well as x86 (could be maybe found out by disassembling) or the > > > > Pixel6 cpu somhow caches these out of line blocks more aggressively and only > > > > a function call stops it? > > > > > > I'll disassemble the code and will see what it looks like. > > > > I think I asked you to do that too ;-) > > Yes you did! And I disassembled almost each of these functions during > my investigation but in my infinite wisdom I did not save any of them. > So, now I need to do that again to answer Vlastimil's question. I'll > try to do that today. Yeah, quite a difference. This is alloc_tagging_slab_alloc_hook() with outlined version of __alloc_tagging_slab_alloc_hook(): ffffffc0803a2dd8 <alloc_tagging_slab_alloc_hook>: ffffffc0803a2dd8: d503201f nop ffffffc0803a2ddc: d65f03c0 ret ffffffc0803a2de0: d503233f paciasp ffffffc0803a2de4: a9bf7bfd stp x29, x30, [sp, #-0x10]! ffffffc0803a2de8: 910003fd mov x29, sp ffffffc0803a2dec: 94000004 bl 0xffffffc0803a2dfc <__alloc_tagging_slab_alloc_hook> ffffffc0803a2df0: a8c17bfd ldp x29, x30, [sp], #0x10 ffffffc0803a2df4: d50323bf autiasp ffffffc0803a2df8: d65f03c0 ret This is the same function with inlined version of __alloc_tagging_slab_alloc_hook(): ffffffc0803a2dd8 <alloc_tagging_slab_alloc_hook>: ffffffc0803a2dd8: d503233f paciasp ffffffc0803a2ddc: d10103ff sub sp, sp, #0x40 ffffffc0803a2de0: a9017bfd stp x29, x30, [sp, #0x10] ffffffc0803a2de4: f90013f5 str x21, [sp, #0x20] ffffffc0803a2de8: a9034ff4 stp x20, x19, [sp, #0x30] ffffffc0803a2dec: 910043fd add x29, sp, #0x10 ffffffc0803a2df0: d503201f nop ffffffc0803a2df4: a9434ff4 ldp x20, x19, [sp, #0x30] ffffffc0803a2df8: f94013f5 ldr x21, [sp, #0x20] ffffffc0803a2dfc: a9417bfd ldp x29, x30, [sp, #0x10] ffffffc0803a2e00: 910103ff add sp, sp, #0x40 ffffffc0803a2e04: d50323bf autiasp ffffffc0803a2e08: d65f03c0 ret ffffffc0803a2e0c: b4ffff41 cbz x1, 0xffffffc0803a2df4 <alloc_tagging_slab_alloc_hook+0x1c> ffffffc0803a2e10: b9400808 ldr w8, [x0, #0x8] ffffffc0803a2e14: 12060049 and w9, w2, #0x4000000 ffffffc0803a2e18: 12152108 and w8, w8, #0xff800 ffffffc0803a2e1c: 120d6108 and w8, w8, #0xfff80fff ffffffc0803a2e20: 2a090108 orr w8, w8, w9 ffffffc0803a2e24: 35fffe88 cbnz w8, 0xffffffc0803a2df4 <alloc_tagging_slab_alloc_hook+0x1c> ffffffc0803a2e28: d378dc28 lsl x8, x1, #8 ffffffc0803a2e2c: d2c01009 mov x9, #0x8000000000 // =549755813888 ffffffc0803a2e30: f9000fa0 str x0, [x29, #0x18] ffffffc0803a2e34: f90007e1 str x1, [sp, #0x8] ffffffc0803a2e38: 8b882128 add x8, x9, x8, asr #8 ffffffc0803a2e3c: b25f7be9 mov x9, #-0x200000000 // =-8589934592 ffffffc0803a2e40: f2b80009 movk x9, #0xc000, lsl #16 ffffffc0803a2e44: d34cfd08 lsr x8, x8, #12 ffffffc0803a2e48: 8b081928 add x8, x9, x8, lsl #6 ffffffc0803a2e4c: f9400509 ldr x9, [x8, #0x8] ffffffc0803a2e50: d100052a sub x10, x9, #0x1 ffffffc0803a2e54: 7200013f tst w9, #0x1 ffffffc0803a2e58: 9a8a0108 csel x8, x8, x10, eq ffffffc0803a2e5c: 3940cd09 ldrb w9, [x8, #0x33] ffffffc0803a2e60: 7103d53f cmp w9, #0xf5 ffffffc0803a2e64: 9a9f0113 csel x19, x8, xzr, eq ffffffc0803a2e68: f9401e68 ldr x8, [x19, #0x38] ffffffc0803a2e6c: f1001d1f cmp x8, #0x7 ffffffc0803a2e70: 540000a8 b.hi 0xffffffc0803a2e84 <alloc_tagging_slab_alloc_hook+0xac> ffffffc0803a2e74: aa1303e0 mov x0, x19 ffffffc0803a2e78: 2a1f03e3 mov w3, wzr ffffffc0803a2e7c: 97ffd6a5 bl 0xffffffc080398910 <alloc_slab_obj_exts> ffffffc0803a2e80: 350009c0 cbnz w0, 0xffffffc0803a2fb8 <alloc_tagging_slab_alloc_hook+0x1e0> ffffffc0803a2e84: b000f2c8 adrp x8, 0xffffffc0821fb000 <max_load_balance_interval> ffffffc0803a2e88: f9401e6a ldr x10, [x19, #0x38] ffffffc0803a2e8c: f9453909 ldr x9, [x8, #0xa70] ffffffc0803a2e90: 927df148 and x8, x10, #0xfffffffffffffff8 ffffffc0803a2e94: b40000e9 cbz x9, 0xffffffc0803a2eb0 <alloc_tagging_slab_alloc_hook+0xd8> ffffffc0803a2e98: f94007ea ldr x10, [sp, #0x8] ffffffc0803a2e9c: cb090149 sub x9, x10, x9 ffffffc0803a2ea0: f142013f cmp x9, #0x80, lsl #12 // =0x80000 ffffffc0803a2ea4: 54000062 b.hs 0xffffffc0803a2eb0 <alloc_tagging_slab_alloc_hook+0xd8> ffffffc0803a2ea8: aa1f03e9 mov x9, xzr ffffffc0803a2eac: 14000015 b 0xffffffc0803a2f00 <alloc_tagging_slab_alloc_hook+0x128> ffffffc0803a2eb0: d2ffe009 mov x9, #-0x100000000000000 // =-72057594037927936 ffffffc0803a2eb4: 14000002 b 0xffffffc0803a2ebc <alloc_tagging_slab_alloc_hook+0xe4> ffffffc0803a2eb8: aa1f03e9 mov x9, xzr ffffffc0803a2ebc: d2dffa0a mov x10, #0xffd000000000 // =281268818280448 ffffffc0803a2ec0: f2e01fea movk x10, #0xff, lsl #48 ffffffc0803a2ec4: 8b13194a add x10, x10, x19, lsl #6 ffffffc0803a2ec8: 9274ad4a and x10, x10, #0xfffffffffff000 ffffffc0803a2ecc: aa0a012a orr x10, x9, x10 ffffffc0803a2ed0: f9400fa9 ldr x9, [x29, #0x18] ffffffc0803a2ed4: f940112b ldr x11, [x9, #0x20] ffffffc0803a2ed8: f94007e9 ldr x9, [sp, #0x8] ffffffc0803a2edc: cb0a0129 sub x9, x9, x10 ffffffc0803a2ee0: d360fd6c lsr x12, x11, #32 ffffffc0803a2ee4: 9bab7d2a umull x10, w9, w11 ffffffc0803a2ee8: d368fd6b lsr x11, x11, #40 ffffffc0803a2eec: d360fd4a lsr x10, x10, #32 ffffffc0803a2ef0: 4b0a0129 sub w9, w9, w10 ffffffc0803a2ef4: 1acc2529 lsr w9, w9, w12 ffffffc0803a2ef8: 0b0a0129 add w9, w9, w10 ffffffc0803a2efc: 1acb2529 lsr w9, w9, w11 ffffffc0803a2f00: ab091109 adds x9, x8, x9, lsl #4 ffffffc0803a2f04: f9400fa8 ldr x8, [x29, #0x18] ffffffc0803a2f08: 54fff760 b.eq 0xffffffc0803a2df4 <alloc_tagging_slab_alloc_hook+0x1c> ffffffc0803a2f0c: b1002129 adds x9, x9, #0x8 ffffffc0803a2f10: 54fff720 b.eq 0xffffffc0803a2df4 <alloc_tagging_slab_alloc_hook+0x1c> ffffffc0803a2f14: d5384113 mrs x19, SP_EL0 ffffffc0803a2f18: f9402a74 ldr x20, [x19, #0x50] ffffffc0803a2f1c: b4fff6d4 cbz x20, 0xffffffc0803a2df4 <alloc_tagging_slab_alloc_hook+0x1c> ffffffc0803a2f20: b9401915 ldr w21, [x8, #0x18] ffffffc0803a2f24: f9000134 str x20, [x9] ffffffc0803a2f28: b9401268 ldr w8, [x19, #0x10] ffffffc0803a2f2c: 11000508 add w8, w8, #0x1 ffffffc0803a2f30: b9001268 str w8, [x19, #0x10] ffffffc0803a2f34: f9401288 ldr x8, [x20, #0x20] ffffffc0803a2f38: d538d089 mrs x9, TPIDR_EL1 ffffffc0803a2f3c: 8b090108 add x8, x8, x9 ffffffc0803a2f40: 52800029 mov w9, #0x1 // =1 ffffffc0803a2f44: 91002108 add x8, x8, #0x8 ffffffc0803a2f48: c85f7d0b ldxr x11, [x8] ffffffc0803a2f4c: 8b09016b add x11, x11, x9 ffffffc0803a2f50: c80a7d0b stxr w10, x11, [x8] ffffffc0803a2f54: 35ffffaa cbnz w10, 0xffffffc0803a2f48 <alloc_tagging_slab_alloc_hook+0x170> ffffffc0803a2f58: f9400a68 ldr x8, [x19, #0x10] ffffffc0803a2f5c: f1000508 subs x8, x8, #0x1 ffffffc0803a2f60: b9001268 str w8, [x19, #0x10] ffffffc0803a2f64: 540003c0 b.eq 0xffffffc0803a2fdc <alloc_tagging_slab_alloc_hook+0x204> ffffffc0803a2f68: f9400a68 ldr x8, [x19, #0x10] ffffffc0803a2f6c: b4000388 cbz x8, 0xffffffc0803a2fdc <alloc_tagging_slab_alloc_hook+0x204> ffffffc0803a2f70: b9401268 ldr w8, [x19, #0x10] ffffffc0803a2f74: 11000508 add w8, w8, #0x1 ffffffc0803a2f78: b9001268 str w8, [x19, #0x10] ffffffc0803a2f7c: f9401288 ldr x8, [x20, #0x20] ffffffc0803a2f80: d538d089 mrs x9, TPIDR_EL1 ffffffc0803a2f84: 8b080128 add x8, x9, x8 ffffffc0803a2f88: c85f7d0a ldxr x10, [x8] ffffffc0803a2f8c: 8b15014a add x10, x10, x21 ffffffc0803a2f90: c8097d0a stxr w9, x10, [x8] ffffffc0803a2f94: 35ffffa9 cbnz w9, 0xffffffc0803a2f88 <alloc_tagging_slab_alloc_hook+0x1b0> ffffffc0803a2f98: f9400a68 ldr x8, [x19, #0x10] ffffffc0803a2f9c: f1000508 subs x8, x8, #0x1 ffffffc0803a2fa0: b9001268 str w8, [x19, #0x10] ffffffc0803a2fa4: 54000060 b.eq 0xffffffc0803a2fb0 <alloc_tagging_slab_alloc_hook+0x1d8> ffffffc0803a2fa8: f9400a68 ldr x8, [x19, #0x10] ffffffc0803a2fac: b5fff248 cbnz x8, 0xffffffc0803a2df4 <alloc_tagging_slab_alloc_hook+0x1c> ffffffc0803a2fb0: 94344478 bl 0xffffffc0810b4190 <preempt_schedule_notrace> ffffffc0803a2fb4: 17ffff90 b 0xffffffc0803a2df4 <alloc_tagging_slab_alloc_hook+0x1c> ffffffc0803a2fb8: f9400fa8 ldr x8, [x29, #0x18] ffffffc0803a2fbc: f00092c0 adrp x0, 0xffffffc0815fd000 <f_midi_shortname+0x4cf4> ffffffc0803a2fc0: 910e5400 add x0, x0, #0x395 ffffffc0803a2fc4: d00099c1 adrp x1, 0xffffffc0816dc000 <longname+0x2727d> ffffffc0803a2fc8: 911d1421 add x1, x1, #0x745 ffffffc0803a2fcc: f9403102 ldr x2, [x8, #0x60] ffffffc0803a2fd0: 97f46d47 bl 0xffffffc0800be4ec <__warn_printk> ffffffc0803a2fd4: d4210000 brk #0x800 ffffffc0803a2fd8: 17ffff87 b 0xffffffc0803a2df4 <alloc_tagging_slab_alloc_hook+0x1c> ffffffc0803a2fdc: 9434446d bl 0xffffffc0810b4190 <preempt_schedule_notrace> ffffffc0803a2fe0: 17ffffe4 b 0xffffffc0803a2f70 <alloc_tagging_slab_alloc_hook+0x198> > > > > > > > > > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > > > > > > > Kinda sad that despite the static key we have to control a lot by the > > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT in addition. > > > > > > I agree. If there is a better way to fix this regression I'm open to > > > changes. Let's wait for Steven to confirm my understanding before > > > proceeding. > > > > How slow is it to always do the call instead of inlining? > > Let's see... The additional overhead if we always call is: > > Little core: 2.42% > Middle core: 1.23% > Big core: 0.66% > > Not a huge deal because the overhead of memory profiling when enabled > is much higher. So, maybe for simplicity I should indeed always call? > > > > > -- Steve