On 1/29/25 03:54, Suren Baghdasaryan wrote: > On Tue, Jan 28, 2025 at 3:43 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: >> >> On Tue, Jan 28, 2025 at 11:35 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: >> > >> > On Mon, 27 Jan 2025 11:38:32 -0800 >> > Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: >> > >> > > On Sun, Jan 26, 2025 at 8:47 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: >> > > > >> > > > On 1/26/25 08:02, Suren Baghdasaryan wrote: >> > > > > When a sizable code section is protected by a disabled static key, that >> > > > > code gets into the instruction cache even though it's not executed and >> > > > > consumes the cache, increasing cache misses. This can be remedied by >> > > > > moving such code into a separate uninlined function. The improvement >> > > >> > > Sorry, I missed adding Steven Rostedt into the CC list since his >> > > advice was instrumental in finding the way to optimize the static key >> > > performance in this patch. Added now. >> > > >> > > > >> > > > Weird, I thought the static_branch_likely/unlikely/maybe was already >> > > > handling this by the unlikely case being a jump to a block away from the >> > > > fast-path stream of instructions, thus making it less likely to get cached. >> > > > AFAIU even plain likely()/unlikely() should do this, along with branch >> > > > prediction hints. >> > > >> > > This was indeed an unexpected overhead when I measured it on Android. >> > > Cache pollution was my understanding of the cause for this high >> > > overhead after Steven told me to try uninlining the protected code. He >> > > has done something similar in the tracing subsystem. But maybe I >> > > misunderstood the real reason. Steven, could you please verify if my >> > > understanding of the high overhead cause is correct here? Maybe there >> > > is something else at play that I missed? >> > >> > From what I understand, is that the compiler will only move code to the end >> > of a function with the unlikely(). But, the code after the function could >> > also be in the control flow path. If you have several functions that are >> > called together, by adding code to the unlikely() cases may not help the >> > speed. >> > >> > I made an effort to make the tracepoint code call functions instead of >> > having everything inlined. It actually brought down the size of the text of >> > the kernel, but looking in the change logs I never posted benchmarks. But >> > I'm sure making the size of the scheduler text section smaller probably did >> > help. >> > >> > > > That would be in line with my understanding above. Does the arm64 compiler >> > > > not do it as well as x86 (could be maybe found out by disassembling) or the >> > > > Pixel6 cpu somhow caches these out of line blocks more aggressively and only >> > > > a function call stops it? >> > > >> > > I'll disassemble the code and will see what it looks like. >> > >> > I think I asked you to do that too ;-) >> >> Yes you did! And I disassembled almost each of these functions during >> my investigation but in my infinite wisdom I did not save any of them. >> So, now I need to do that again to answer Vlastimil's question. I'll >> try to do that today. > > Yeah, quite a difference. This is alloc_tagging_slab_alloc_hook() with > outlined version of __alloc_tagging_slab_alloc_hook(): Not fluent in arm64 assembly but let's see... > ffffffc0803a2dd8 <alloc_tagging_slab_alloc_hook>: > ffffffc0803a2dd8: d503201f nop > ffffffc0803a2ddc: d65f03c0 ret So that's an immediate return unless static key rewrites the nop. BTW, I wouldn't expect the alloc_tagging_slab_alloc_hook() to exist as a separate function in the first place, since it's "static inline". It seems weird to do a function call to a static key test. We should perhaps force inline it. > ffffffc0803a2de0: d503233f paciasp > ffffffc0803a2de4: a9bf7bfd stp x29, x30, [sp, #-0x10]! > ffffffc0803a2de8: 910003fd mov x29, sp > ffffffc0803a2dec: 94000004 bl 0xffffffc0803a2dfc > <__alloc_tagging_slab_alloc_hook> > ffffffc0803a2df0: a8c17bfd ldp x29, x30, [sp], #0x10 > ffffffc0803a2df4: d50323bf autiasp > ffffffc0803a2df8: d65f03c0 ret > > This is the same function with inlined version of > __alloc_tagging_slab_alloc_hook(): > > ffffffc0803a2dd8 <alloc_tagging_slab_alloc_hook>: > ffffffc0803a2dd8: d503233f paciasp > ffffffc0803a2ddc: d10103ff sub sp, sp, #0x40 > ffffffc0803a2de0: a9017bfd stp x29, x30, [sp, #0x10] > ffffffc0803a2de4: f90013f5 str x21, [sp, #0x20] > ffffffc0803a2de8: a9034ff4 stp x20, x19, [sp, #0x30] > ffffffc0803a2dec: 910043fd add x29, sp, #0x10 > ffffffc0803a2df0: d503201f nop > ffffffc0803a2df4: a9434ff4 ldp x20, x19, [sp, #0x30] > ffffffc0803a2df8: f94013f5 ldr x21, [sp, #0x20] > ffffffc0803a2dfc: a9417bfd ldp x29, x30, [sp, #0x10] > ffffffc0803a2e00: 910103ff add sp, sp, #0x40 > ffffffc0803a2e04: d50323bf autiasp > ffffffc0803a2e08: d65f03c0 ret Seems to me this will also return unless the nop is rewritten, but instead of making a call reachable there will be a jump to below? Now is the overhead larger because the code below gets cached, or because the block above is doing more in the disabled case? It looks quite suboptimal. > ffffffc0803a2e0c: b4ffff41 cbz x1, 0xffffffc0803a2df4 > <alloc_tagging_slab_alloc_hook+0x1c> > ffffffc0803a2e10: b9400808 ldr w8, [x0, #0x8] > ffffffc0803a2e14: 12060049 and w9, w2, #0x4000000 > ffffffc0803a2e18: 12152108 and w8, w8, #0xff800 > ffffffc0803a2e1c: 120d6108 and w8, w8, #0xfff80fff > ffffffc0803a2e20: 2a090108 orr w8, w8, w9