On 1/17/23 19:21, Mark Wielaard wrote: > Hi Daniel, > > On Mon, Jan 16, 2023 at 08:30:21PM -0000, Daniel Colascione wrote: >> As mentioned in [1], instead of finding a way to have the kernel >> unwind user programs, we can create a protocol through which the >> kernel can ask usermode to unwind itself. > > I like this idea and was discussing and thinking along similar lines. > It would be great if we can implement this using existing mechanisms > to prototype something to show it is feasible and fast enough. And > for existing linux/glibc installs that don't yet have the new > interfaces. > >> It could work like this: >> >> 1) backtrace requested in the kernel (e.g. to a perf counter >> overflow) >> >> 2) kernel unwinds itself to the userspace boundary the usual way >> >> 3) kernel forms a nonce (e.g. by incrementing a 64-bit counter) >> >> 4) kernel logs a stack trace the usual way (e.g. to the ftrace ring >> buffer), but with the final frame referring to the nonce created in >> the previous step >> >> 5) kernel queues a signal (one userspace has explicitly opted into >> via a new prctl()); the siginfo_t structure encodes (e.g. via >> si_status and si_value) the nonce > > So before it does this prctl the process needs to setup all > datastructures it needs to safely handle the unwinding during signal > handling. That does mean that early process setup won't be able to be > profiled with user backtraces. There could be an unwinder in the vDSO that is used if userspace has not registered a replacement one. >> 6) kernel eventually returns to userspace; queued signal handler >> gains control > > So at this point, if the event triggered while that user space thread > was running the event logged by the kernel is basically just that > nonce? > >> 7) signal handler unwinds the calling thread however it wants (and >> can sleep and take page faults if needed) > > So in theory this can do anything that is async signal safe. But what > if it takes too long and another event gets triggered? Or it does a > syscall that produces an event? Another signal arrives? Or if it > causes a SEGV? > >> 8) signal handler logs the result of its unwind, along with the >> nonce, to the system log (e.g. via a new system call, a sysfs write, >> an io_uring submission, etc.) > > A (shared) memory region seems simplest, whatever has been put into it > when the signal handler returns is the user space backtrace > contribution. Maybe the prctl call can set this up? Or maybe the > kernel can provide it through one of the siginfo_t fields? siginfo_t seems simplest, and also works with a vDSO unwinder. >> Post-processing tools can associate kernel stacks with user stacks >> tagged with the corresponding nonces and reconstitute the full >> stacks in effect at the time of each logged event. >> >> We can avoid duplicating unwindgs too: at step #3, if the kernel >> finds that the current thread already has an unwind pending, it can >> uses the already-pending nonce instead of making a new one and >> queuing a signal: many kernel stacks can end with the same user >> stack "tail". > > This is probably a generic optimization for most backtraces, most will > have a similar tail. > >> One nice property of this scheme is that the userspace unwinding >> isn't limited to native code. Libc could arbitrate unwinding across >> an arbitrary number of managed runtime environments in the context >> of a single process: the system could be smart enough to know that >> instead of unwinding through, e.g. Python interpreter frames, the >> unwinder (which is normal userspace code, pluggable via DSO!) could >> traverse and log *Python* stack frames instead, with meaningful >> function names. And if you happened to have, say, a JavaScript >> runtime in the same process, both JavaScript and Python could >> participate in the semantic unwinding process. >> >> A pluggable userspace unwind mechanism would have zero cost in the >> case that we're not recording stack frames. On top of that, a >> pluggable userspace unwinder *could* be written to traverse frame >> pointers just as the kernel unwinder does today, if userspace thinks >> that's the best option. Without breaking kernel ABI, that userspace >> unwinder could use DWARF, or ORC, or any other userspace unwinding >> approach. It's future-proof. > > This is nice, but does need some coordination for handing off the > unwinding context between different unwinders. You could register an > unwinder by memory region (if address is in this range, it is in the > lua interpreter). And then you can start generating a backtrace using > the dedicated unwinder for that memory region. That original unwinder > can start with the full ucontext_t. But what is the contract between > unwinders? If we start with e.g. a frame-pointer unwinder then when we > get to a part that needs to use the python unwinder we only have a pc, > sp and fp left. Is that enough context for the python unwinder to > continue? > > What would we need to prototype this idea and show that we can produce > quick backtraces using fast eh_frame unwinding before we convinced the > kernel to provide this interface and have it in glibc (or the vdso, > without any fancy caching, it might fit the vdso)? > > We could use LD_PRELOAD or some ptrace parasite code like criu uses to > insert the unwinder code and trigger registration, use an itimer and > SIGPROF as signal and shared memory to use as ringbuffer to provide > the backtrace addresses (and possible other context - build-id > mappings?) for a profiling app/library to read events from. Without > explicit kernel support that might not feel like system wide > profiling, but should give us a feel of how well it would work. Or are > there other holes/missing functionality? > > Cheers, > > Mark > >> [1] https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx/message/646XXHGEGOKO465LQKWCPPPAZBSW5NWO/ > > _______________________________________________ > devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx > To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx > Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx > Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue -- Sincerely, Demi Marie Obenour (she/her/hers) _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue