Re: [PATCH 2/3] alloc_tag: uninline code gated by mem_alloc_profiling_key in slab allocator

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Tue, 28 Jan 2025 15:43:13 -0800

On Tue, Jan 28, 2025 at 11:35 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Mon, 27 Jan 2025 11:38:32 -0800
> Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> > On Sun, Jan 26, 2025 at 8:47 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
> > >
> > > On 1/26/25 08:02, Suren Baghdasaryan wrote:
> > > > When a sizable code section is protected by a disabled static key, that
> > > > code gets into the instruction cache even though it's not executed and
> > > > consumes the cache, increasing cache misses. This can be remedied by
> > > > moving such code into a separate uninlined function. The improvement
> >
> > Sorry, I missed adding Steven Rostedt into the CC list since his
> > advice was instrumental in finding the way to optimize the static key
> > performance in this patch. Added now.
> >
> > >
> > > Weird, I thought the static_branch_likely/unlikely/maybe was already
> > > handling this by the unlikely case being a jump to a block away from the
> > > fast-path stream of instructions, thus making it less likely to get cached.
> > > AFAIU even plain likely()/unlikely() should do this, along with branch
> > > prediction hints.
> >
> > This was indeed an unexpected overhead when I measured it on Android.
> > Cache pollution was my understanding of the cause for this high
> > overhead after Steven told me to try uninlining the protected code. He
> > has done something similar in the tracing subsystem. But maybe I
> > misunderstood the real reason. Steven, could you please verify if my
> > understanding of the high overhead cause is correct here? Maybe there
> > is something else at play that I missed?
>
> From what I understand, is that the compiler will only move code to the end
> of a function with the unlikely(). But, the code after the function could
> also be in the control flow path. If you have several functions that are
> called together, by adding code to the unlikely() cases may not help the
> speed.
>
> I made an effort to make the tracepoint code call functions instead of
> having everything inlined. It actually brought down the size of the text of
> the kernel, but looking in the change logs I never posted benchmarks. But
> I'm sure making the size of the scheduler text section smaller probably did
> help.
>
> > > That would be in line with my understanding above. Does the arm64 compiler
> > > not do it as well as x86 (could be maybe found out by disassembling) or the
> > > Pixel6 cpu somhow caches these out of line blocks more aggressively and only
> > > a function call stops it?
> >
> > I'll disassemble the code and will see what it looks like.
>
> I think I asked you to do that too ;-)

Yes you did! And I disassembled almost each of these functions during
my investigation but in my infinite wisdom I did not save any of them.
So, now I need to do that again to answer Vlastimil's question. I'll
try to do that today.

>
> >
> > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > >
> > > Kinda sad that despite the static key we have to control a lot by the
> > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT in addition.
> >
> > I agree. If there is a better way to fix this regression I'm open to
> > changes. Let's wait for Steven to confirm my understanding before
> > proceeding.
>
> How slow is it to always do the call instead of inlining?

Let's see... The additional overhead if we always call is:

Little core: 2.42%
Middle core: 1.23%
Big core: 0.66%

Not a huge deal because the overhead of memory profiling when enabled
is much higher. So, maybe for simplicity I should indeed always call?

>
> -- Steve