Re: Looking at profile data once again - avc lookup

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 30 Jan 2023 10:35:20 -0800

On Mon, Jan 30, 2023 at 9:47 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
>
> I should add, do you have any particular test script you use?  If not,
> that's fine, I just cobble something together, but I figured if you
> had something already it would save me having to remember the details
> on the perf tools.

So I've done various things over the years, including just writing
special tools that do nothing but a recursive 'stat()' over and over
again over a big tree, just to pinpoint the path lookup costs (big
enough of a tree that you see actual cache effects). Then do that
either single-threaded or multi-threaded to see the locking issues.

But what I keep coming back to is to just have a fully built "make
allmodconfig" tree - which I have _anyway_, and then doing

     perf record -e cycles:pp make -j64

on it. You'll need to do something like

    echo 1 > /proc/sys/kernel/perf_event_paranoid

before starting your profiling session to make it possible to do that
profile as a normal user and get the kernel data.

And then look at the end result with just

    perf report --sort=dso,symbol

which avoids sorting by process, because I don't care _which_ process
does something, I just want to see the kernel symbol table end
results. Press 'k' to zoom into just the kernel profile, and Bob's
your uncle.

You can play with callchain data ("-g"), but I tend to like the plain
flat profile to just see where things are happening. I'll do the call
chain if I then start to look into things like "which caller was the
main reason for that queued_spin_lock_slowpath cost" kinds of things,
but it's not always even necessary.

Unless you have some big kernel debugging options on, what you
normally see for that load would be

 - memset and memcpy (including very much our user-space versions of
it, like clear_page_rep and copy_user_generic_string).

 - depending on number of CPU cores, locking (I *despise*
folio_memcg_lock, but that's not from the pathname lookup, it's the
page fault path, particularly WP faults)

 - page table setup and clearing

 - ... and finally, pathname walking, generally with
selinux_inode_permission and avc_has_perm_*() fairly high up

So it's not like 'make' is dominated by pathname walking - the process
related stuff tends to be higher - but I've ended up using that as a
kernel profile source because it's a real load for me.

Also, most of the time by far is spent in 'make' doing various string
things in user space. Our kernel makefiles tend to have a lot of
symbol expansion etc. I just ignore all the user space stuff.

There are other loads I occasionally look at, but this is basically
the one I always tend to return to because it tends to stress the two
things I personally end up interested in - the VFS layer and the VM
code. I don't really tend to do IO etc.

Put another way: there's nothing _special_ about the above, except for
the "it's a real load that does actually show a few core kernel
areas".

Also, the above is just about the least fancy use of prof you'll ever
see. No events, no special hardware counters for things like cache
misses or anything, just plain old "where does the time go". I do end
up looking at the annotated assembly code (press 'a' on the selected
symbol), but it's worth noting that even with hardware profiling
(Intel: PEBS, AMD: IBS), saying "exactly where did we spend time" is a
pretty ambiguous thing on modern OoO cores - you have to interpret the
data by just seeing lots of it. But usually you can see "hot loop
here", "mispredict there" or "that load is taking cache misses", so
the instruction-level profiles do need to be taken with a huge grain
of salt and some experience with that microarchitecture to really make
sense ot them.

                   Linus