On Mon, Jan 30, 2023 at 9:47 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > I should add, do you have any particular test script you use? If not, > that's fine, I just cobble something together, but I figured if you > had something already it would save me having to remember the details > on the perf tools. So I've done various things over the years, including just writing special tools that do nothing but a recursive 'stat()' over and over again over a big tree, just to pinpoint the path lookup costs (big enough of a tree that you see actual cache effects). Then do that either single-threaded or multi-threaded to see the locking issues. But what I keep coming back to is to just have a fully built "make allmodconfig" tree - which I have _anyway_, and then doing perf record -e cycles:pp make -j64 on it. You'll need to do something like echo 1 > /proc/sys/kernel/perf_event_paranoid before starting your profiling session to make it possible to do that profile as a normal user and get the kernel data. And then look at the end result with just perf report --sort=dso,symbol which avoids sorting by process, because I don't care _which_ process does something, I just want to see the kernel symbol table end results. Press 'k' to zoom into just the kernel profile, and Bob's your uncle. You can play with callchain data ("-g"), but I tend to like the plain flat profile to just see where things are happening. I'll do the call chain if I then start to look into things like "which caller was the main reason for that queued_spin_lock_slowpath cost" kinds of things, but it's not always even necessary. Unless you have some big kernel debugging options on, what you normally see for that load would be - memset and memcpy (including very much our user-space versions of it, like clear_page_rep and copy_user_generic_string). - depending on number of CPU cores, locking (I *despise* folio_memcg_lock, but that's not from the pathname lookup, it's the page fault path, particularly WP faults) - page table setup and clearing - ... and finally, pathname walking, generally with selinux_inode_permission and avc_has_perm_*() fairly high up So it's not like 'make' is dominated by pathname walking - the process related stuff tends to be higher - but I've ended up using that as a kernel profile source because it's a real load for me. Also, most of the time by far is spent in 'make' doing various string things in user space. Our kernel makefiles tend to have a lot of symbol expansion etc. I just ignore all the user space stuff. There are other loads I occasionally look at, but this is basically the one I always tend to return to because it tends to stress the two things I personally end up interested in - the VFS layer and the VM code. I don't really tend to do IO etc. Put another way: there's nothing _special_ about the above, except for the "it's a real load that does actually show a few core kernel areas". Also, the above is just about the least fancy use of prof you'll ever see. No events, no special hardware counters for things like cache misses or anything, just plain old "where does the time go". I do end up looking at the annotated assembly code (press 'a' on the selected symbol), but it's worth noting that even with hardware profiling (Intel: PEBS, AMD: IBS), saying "exactly where did we spend time" is a pretty ambiguous thing on modern OoO cores - you have to interpret the data by just seeing lots of it. But usually you can see "hot loop here", "mispredict there" or "that load is taking cache misses", so the instruction-level profiles do need to be taken with a huge grain of salt and some experience with that microarchitecture to really make sense ot them. Linus