Re: Yet another unwinding approach

Demi Marie Obenour <demiobenour@xxxxxxxxx> · Tue, 17 Jan 2023 20:09:18 -0500

On 1/17/23 19:21, Mark Wielaard wrote:
> Hi Daniel,
> 
> On Mon, Jan 16, 2023 at 08:30:21PM -0000, Daniel Colascione wrote:
>> As mentioned in [1], instead of finding a way to have the kernel
>> unwind user programs, we can create a protocol through which the
>> kernel can ask usermode to unwind itself.
> 
> I like this idea and was discussing and thinking along similar lines.
> It would be great if we can implement this using existing mechanisms
> to prototype something to show it is feasible and fast enough.  And
> for existing linux/glibc installs that don't yet have the new
> interfaces.
> 
>> It could work like this:
>>
>> 1) backtrace requested in the kernel (e.g. to a perf counter
>> overflow)
>>
>> 2) kernel unwinds itself to the userspace boundary the usual way
>>
>> 3) kernel forms a nonce (e.g. by incrementing a 64-bit counter)
>>
>> 4) kernel logs a stack trace the usual way (e.g. to the ftrace ring
>> buffer), but with the final frame referring to the nonce created in
>> the previous step
>>
>> 5) kernel queues a signal (one userspace has explicitly opted into
>> via a new prctl()); the siginfo_t structure encodes (e.g. via
>> si_status and si_value) the nonce
> 
> So before it does this prctl the process needs to setup all
> datastructures it needs to safely handle the unwinding during signal
> handling. That does mean that early process setup won't be able to be
> profiled with user backtraces.

There could be an unwinder in the vDSO that is used if userspace has
not registered a replacement one.

>> 6) kernel eventually returns to userspace; queued signal handler
>> gains control
> 
> So at this point, if the event triggered while that user space thread
> was running the event logged by the kernel is basically just that
> nonce?
> 
>> 7) signal handler unwinds the calling thread however it wants (and
>> can sleep and take page faults if needed)
> 
> So in theory this can do anything that is async signal safe. But what
> if it takes too long and another event gets triggered? Or it does a
> syscall that produces an event? Another signal arrives? Or if it
> causes a SEGV?
> 
>> 8) signal handler logs the result of its unwind, along with the
>> nonce, to the system log (e.g. via a new system call, a sysfs write,
>> an io_uring submission, etc.)
> 
> A (shared) memory region seems simplest, whatever has been put into it
> when the signal handler returns is the user space backtrace
> contribution. Maybe the prctl call can set this up? Or maybe the
> kernel can provide it through one of the siginfo_t fields?

siginfo_t seems simplest, and also works with a vDSO unwinder.

>> Post-processing tools can associate kernel stacks with user stacks
>> tagged with the corresponding nonces and reconstitute the full
>> stacks in effect at the time of each logged event.
>>
>> We can avoid duplicating unwindgs too: at step #3, if the kernel
>> finds that the current thread already has an unwind pending, it can
>> uses the already-pending nonce instead of making a new one and
>> queuing a signal: many kernel stacks can end with the same user
>> stack "tail".
> 
> This is probably a generic optimization for most backtraces, most will
> have a similar tail.
>  
>> One nice property of this scheme is that the userspace unwinding
>> isn't limited to native code. Libc could arbitrate unwinding across
>> an arbitrary number of managed runtime environments in the context
>> of a single process: the system could be smart enough to know that
>> instead of unwinding through, e.g. Python interpreter frames, the
>> unwinder (which is normal userspace code, pluggable via DSO!) could
>> traverse and log *Python* stack frames instead, with meaningful
>> function names. And if you happened to have, say, a JavaScript
>> runtime in the same process, both JavaScript and Python could
>> participate in the semantic unwinding process.
>>
>> A pluggable userspace unwind mechanism would have zero cost in the
>> case that we're not recording stack frames. On top of that, a
>> pluggable userspace unwinder *could* be written to traverse frame
>> pointers just as the kernel unwinder does today, if userspace thinks
>> that's the best option. Without breaking kernel ABI, that userspace
>> unwinder could use DWARF, or ORC, or any other userspace unwinding
>> approach. It's future-proof.
> 
> This is nice, but does need some coordination for handing off the
> unwinding context between different unwinders. You could register an
> unwinder by memory region (if address is in this range, it is in the
> lua interpreter). And then you can start generating a backtrace using
> the dedicated unwinder for that memory region. That original unwinder
> can start with the full ucontext_t. But what is the contract between
> unwinders? If we start with e.g. a frame-pointer unwinder then when we
> get to a part that needs to use the python unwinder we only have a pc,
> sp and fp left. Is that enough context for the python unwinder to
> continue?
> 
> What would we need to prototype this idea and show that we can produce
> quick backtraces using fast eh_frame unwinding before we convinced the
> kernel to provide this interface and have it in glibc (or the vdso,
> without any fancy caching, it might fit the vdso)?
> 
> We could use LD_PRELOAD or some ptrace parasite code like criu uses to
> insert the unwinder code and trigger registration, use an itimer and
> SIGPROF as signal and shared memory to use as ringbuffer to provide
> the backtrace addresses (and possible other context - build-id
> mappings?) for a profiling app/library to read events from. Without
> explicit kernel support that might not feel like system wide
> profiling, but should give us a feel of how well it would work. Or are
> there other holes/missing functionality?
> 
> Cheers,
> 
> Mark
>  
>> [1] https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx/message/646XXHGEGOKO465LQKWCPPPAZBSW5NWO/ 
> 
> _______________________________________________
> devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
> Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue