Hi Daniel, On Mon, Jan 16, 2023 at 08:30:21PM -0000, Daniel Colascione wrote: > As mentioned in [1], instead of finding a way to have the kernel > unwind user programs, we can create a protocol through which the > kernel can ask usermode to unwind itself. I like this idea and was discussing and thinking along similar lines. It would be great if we can implement this using existing mechanisms to prototype something to show it is feasible and fast enough. And for existing linux/glibc installs that don't yet have the new interfaces. > It could work like this: > > 1) backtrace requested in the kernel (e.g. to a perf counter > overflow) > > 2) kernel unwinds itself to the userspace boundary the usual way > > 3) kernel forms a nonce (e.g. by incrementing a 64-bit counter) > > 4) kernel logs a stack trace the usual way (e.g. to the ftrace ring > buffer), but with the final frame referring to the nonce created in > the previous step > > 5) kernel queues a signal (one userspace has explicitly opted into > via a new prctl()); the siginfo_t structure encodes (e.g. via > si_status and si_value) the nonce So before it does this prctl the process needs to setup all datastructures it needs to safely handle the unwinding during signal handling. That does mean that early process setup won't be able to be profiled with user backtraces. > 6) kernel eventually returns to userspace; queued signal handler > gains control So at this point, if the event triggered while that user space thread was running the event logged by the kernel is basically just that nonce? > 7) signal handler unwinds the calling thread however it wants (and > can sleep and take page faults if needed) So in theory this can do anything that is async signal safe. But what if it takes too long and another event gets triggered? Or it does a syscall that produces an event? Another signal arrives? Or if it causes a SEGV? > 8) signal handler logs the result of its unwind, along with the > nonce, to the system log (e.g. via a new system call, a sysfs write, > an io_uring submission, etc.) A (shared) memory region seems simplest, whatever has been put into it when the signal handler returns is the user space backtrace contribution. Maybe the prctl call can set this up? Or maybe the kernel can provide it through one of the siginfo_t fields? > Post-processing tools can associate kernel stacks with user stacks > tagged with the corresponding nonces and reconstitute the full > stacks in effect at the time of each logged event. > > We can avoid duplicating unwindgs too: at step #3, if the kernel > finds that the current thread already has an unwind pending, it can > uses the already-pending nonce instead of making a new one and > queuing a signal: many kernel stacks can end with the same user > stack "tail". This is probably a generic optimization for most backtraces, most will have a similar tail. > One nice property of this scheme is that the userspace unwinding > isn't limited to native code. Libc could arbitrate unwinding across > an arbitrary number of managed runtime environments in the context > of a single process: the system could be smart enough to know that > instead of unwinding through, e.g. Python interpreter frames, the > unwinder (which is normal userspace code, pluggable via DSO!) could > traverse and log *Python* stack frames instead, with meaningful > function names. And if you happened to have, say, a JavaScript > runtime in the same process, both JavaScript and Python could > participate in the semantic unwinding process. > > A pluggable userspace unwind mechanism would have zero cost in the > case that we're not recording stack frames. On top of that, a > pluggable userspace unwinder *could* be written to traverse frame > pointers just as the kernel unwinder does today, if userspace thinks > that's the best option. Without breaking kernel ABI, that userspace > unwinder could use DWARF, or ORC, or any other userspace unwinding > approach. It's future-proof. This is nice, but does need some coordination for handing off the unwinding context between different unwinders. You could register an unwinder by memory region (if address is in this range, it is in the lua interpreter). And then you can start generating a backtrace using the dedicated unwinder for that memory region. That original unwinder can start with the full ucontext_t. But what is the contract between unwinders? If we start with e.g. a frame-pointer unwinder then when we get to a part that needs to use the python unwinder we only have a pc, sp and fp left. Is that enough context for the python unwinder to continue? What would we need to prototype this idea and show that we can produce quick backtraces using fast eh_frame unwinding before we convinced the kernel to provide this interface and have it in glibc (or the vdso, without any fancy caching, it might fit the vdso)? We could use LD_PRELOAD or some ptrace parasite code like criu uses to insert the unwinder code and trigger registration, use an itimer and SIGPROF as signal and shared memory to use as ringbuffer to provide the backtrace addresses (and possible other context - build-id mappings?) for a profiling app/library to read events from. Without explicit kernel support that might not feel like system wide profiling, but should give us a feel of how well it would work. Or are there other holes/missing functionality? Cheers, Mark > [1] https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx/message/646XXHGEGOKO465LQKWCPPPAZBSW5NWO/ _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue