On Sat, Oct 21, 2023 at 3:06 PM Richard W.M. Jones <rjones@xxxxxxxxxx> wrote: > I was asked about the topic in the subject, and I think it's not very > well known. The news is that since Fedora 38, whole system > performance analysis is now easy to do. This can be used to identify > hot spots in single applications, or what the (whole computer) is > really doing during lengthy operations. > > You can visualise these in various ways - my favourite is Brendan > Gregg's Flame Graphs tools, but perf has many alternate ways to > capture and display the data: > > https://www.brendangregg.com/linuxperf.html > https://www.brendangregg.com/flamegraphs.html > https://perf.wiki.kernel.org/index.php/Tutorial > > I did a 15 min talk on this topic, actually to an internal Red Hat > audience, but I guess it's fine to open it up to everyone: > > http://oirase.annexia.org/tmp/2023-03-08-flamegraphs.mp4 [57M, 15m41s] Hello Richard, Thank you for posting this. In the talk you mentioned that the "--off-cpu" option was not yet available. Has there been any progress to enable it since the talk was recorded? I have just tried it in Rawhide. perf is still built without it: Warning: option `off-cpu' is being ignored because no BUILD_BPF_SKEL=1 What is blocking the enablement of this feature? Are there some trade-offs? Is there a thread or a Bugzilla ticket where it is discussed? Michal > To show the kind of thing which is possible I have captured three > whole system flame graphs. First comes from doing "make -j32" in the > qemu build tree: > > http://oirase.annexia.org/tmp/2023-gcc-with-lto.svg > > 8% of the time is spent running the assembler. I seem to recall that > Clang uses a different approach of integrating the assembler into the > compiler and I guess it probably avoids most of that overhead. > > The second is an rpmbuild of the Fedora Rawhide kernel package: > > http://oirase.annexia.org/tmp/2023-kernel-build.svg > > I think it's interesting that 6% of the time is spent compressing the > RPMs, and another 6% running pahole (debuginfo generation?) But the > most surprising thing is it appears 42% of the time is spent just > parsing C code [if I'm reading that right, I actually can't believe > parsing takes so much time]. If true there must be opportunities to > optimize things here. > > Captures work across userspace and kernel code, as shown in the third > example which is a KVM (ie. hardware assisted) virtual machine doing > some highly parallel work inside: > > http://oirase.annexia.org/tmp/2023-kvm-build.svg > > You can clearly see the 8 virtual (guest) CPUs on the left side, using > KVM. More interesting is that this guest uses a qcow2 file for disk > and there's a heck of an overhead writing to that file. There's > nothing to fix here -- qcow2 files shouldn't be used in this > situation; for best performance it would be better to use a local > block device to back the guest. > > > The overhead of frame pointers in my measurements is about 1%, so this > enhanced visibility into the system seems well worthwhile. I use this > all the time. This year I've used it to suggest optimizations in > qemu, nbdkit and the kernel. > > Rich. _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue