On Tue, Jan 28, 2025 at 11:35 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > > On Mon, 27 Jan 2025 11:38:32 -0800 > Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > On Sun, Jan 26, 2025 at 8:47 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > > > > > On 1/26/25 08:02, Suren Baghdasaryan wrote: > > > > When a sizable code section is protected by a disabled static key, that > > > > code gets into the instruction cache even though it's not executed and > > > > consumes the cache, increasing cache misses. This can be remedied by > > > > moving such code into a separate uninlined function. The improvement > > > > Sorry, I missed adding Steven Rostedt into the CC list since his > > advice was instrumental in finding the way to optimize the static key > > performance in this patch. Added now. > > > > > > > > Weird, I thought the static_branch_likely/unlikely/maybe was already > > > handling this by the unlikely case being a jump to a block away from the > > > fast-path stream of instructions, thus making it less likely to get cached. > > > AFAIU even plain likely()/unlikely() should do this, along with branch > > > prediction hints. > > > > This was indeed an unexpected overhead when I measured it on Android. > > Cache pollution was my understanding of the cause for this high > > overhead after Steven told me to try uninlining the protected code. He > > has done something similar in the tracing subsystem. But maybe I > > misunderstood the real reason. Steven, could you please verify if my > > understanding of the high overhead cause is correct here? Maybe there > > is something else at play that I missed? > > From what I understand, is that the compiler will only move code to the end > of a function with the unlikely(). But, the code after the function could > also be in the control flow path. If you have several functions that are > called together, by adding code to the unlikely() cases may not help the > speed. > > I made an effort to make the tracepoint code call functions instead of > having everything inlined. It actually brought down the size of the text of > the kernel, but looking in the change logs I never posted benchmarks. But > I'm sure making the size of the scheduler text section smaller probably did > help. > > > > That would be in line with my understanding above. Does the arm64 compiler > > > not do it as well as x86 (could be maybe found out by disassembling) or the > > > Pixel6 cpu somhow caches these out of line blocks more aggressively and only > > > a function call stops it? > > > > I'll disassemble the code and will see what it looks like. > > I think I asked you to do that too ;-) Yes you did! And I disassembled almost each of these functions during my investigation but in my infinite wisdom I did not save any of them. So, now I need to do that again to answer Vlastimil's question. I'll try to do that today. > > > > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > > > > > Kinda sad that despite the static key we have to control a lot by the > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT in addition. > > > > I agree. If there is a better way to fix this regression I'm open to > > changes. Let's wait for Steven to confirm my understanding before > > proceeding. > > How slow is it to always do the call instead of inlining? Let's see... The additional overhead if we always call is: Little core: 2.42% Middle core: 1.23% Big core: 0.66% Not a huge deal because the overhead of memory profiling when enabled is much higher. So, maybe for simplicity I should indeed always call? > > -- Steve