Re: F38 proposal: Add _FORTIFY_SOURCE=3 to distribution build flags (System-Wide Change proposal)

"Andrii Nakryiko" <andrii.nakryiko@xxxxxxxxx> · Tue, 06 Dec 2022 17:46:11 -0000

> On Tue, Dec 06, 2022 at 08:13:51AM -0500, Neal Gompa wrote:
> 
> That is nonsense.  Even with -fno-omit-frame-pointers, you can't rely
> on frame pointers, they are not accurate in function prologues and epilogues
> and they are total garbage e.g. in a lot of functions written in assembly.

First of all, https://pagure.io/fesco/issue/2817 is first and foremost about enabling low-overhead **profiling** of applications (periodic in background or on-demand requested by users explicitly), not debugging use cases. For debugging use cases GDB might be perfectly adequate to rely only on DWARF information. But -fno-omit-frame-pointers is to enable profiling **production workloads** as they happen, because very often reproducibility of results is impossible without understanding the issue in the first place, which is what frame pointers are needed for. Even more often you don't even realize that application is doing something suboptimal unless you profile it continuously, as it handles *real workload*.

Now, about prologues/epilogues. What percentage of useful workload is spent in those? Tiny fraction of a percent at best? Even if we don't get accurate stack trace in such cases it doesn't matter in the grand scheme of things.

As for hand-written assembly functions. I looked at strcmp, memcpy and similar very frequent and hot functions in glibc. Yes, they don't save %rbp and don't maintain frame pointers. But they also don't use %rbp register at all, which means **they don't clobber it**. So if we take stack trace while such function is executing, we'll still get non-broken stack trace all the way up to the root parent function, we just won't have leaf-level function in stack traces. That's much better and more useful than completely broken stack and allows to reason very well about application performance. Also, almost all fpu-related routines under sysdeps/x86_64/fpu/multiarch/svml*.S that are using %rbp, are also properly maintaining frame pointers:

There is a very good reason why Meta enabled -fno-omit-frame-pointers across all its internal software 5 years ago and never looked back. We rely on frame pointers being available across millions of machines to drive performance improvements and investigations saving millions of dollars of real non-hypothetical savings. Google, Netflix, Apple, etc -- all enable frame pointers due to sheer usefulness of them in practice, either for performance profiling and/or better real-time introspection of their workloads.

So much for the "nonsense". I realize that not everyone have experience with tracing production workloads and generally working in profiling area, but I expect people to keep an open mind and not use double standards when thinking about implications of system-wide changes like this one or frame pointers one.

> The only reliable way to get backtraces is DWARF info or something derived
> from it, that is for code emitted by the compilers (with the default
> -fasynchronous-unwind-tables) accurate for all instructions and for
> hand written assembly one has at least a way to describe that through
> .cfi_* directives.
> 
> As has been written multiple times, for profiling there doesn't need to be
> full DWARF unwinder in the kernel, it is enough that there is something
> that can handle the wast majority of cases and punt (copy to userland full
> stack) in case of anything unexpected as long as it is rare.
> Say on x86-64 and various other targets, the stack pointer, CFA (how to get
> caller's stack pointer), frame pointer if any or return address is very
> rarely stored anywhere but on the stack, so it should be enough to consider
> CFA, sp, bp, ip during unwinding and ignore everything else and from the
> harder DW_CFA_* ops (those that need DWARF expression evaluation)
> perhaps only pattern recognize the most common ones (say PLT slot, signal
> frame).

You clearly have never tried to do this in practice. You'd know about the price of discovering and pre-computing such look up tables, both in terms of memory usage, CPU usage, complexity and reliability. And then you'd also realize the problem of tracing short-lived processes that started after profiling started and were discovered upfront. As for the approach with capturing stack and sending it for post-processing. Beyond just the overhead of capturing and sending so much data for post-processing, you'd also stop and wonder what is the right size of stack to capture to handle deep stacks/recursion or functions with large stack size usage.

DWARF is not a panacea. For production workloads it doesn't even come close as an satisfactory solution, if employed in isolation.

> 
> Frame pointers will never result in anything reliable though, it results
> purely in severe performance degradation and false feeling they can be
> relied on.

THIS is nonsense, both on degradation and reliability points. Performance engineers and just typical applications owners laugh at your statement detached from the reality.

> 
> 	Jakub
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue