Re: Yet another unwinding approach

Mark Wielaard <mark@xxxxxxxxx> · Wed, 18 Jan 2023 01:21:04 +0100

Hi Daniel,

On Mon, Jan 16, 2023 at 08:30:21PM -0000, Daniel Colascione wrote:
> As mentioned in [1], instead of finding a way to have the kernel
> unwind user programs, we can create a protocol through which the
> kernel can ask usermode to unwind itself.

I like this idea and was discussing and thinking along similar lines.
It would be great if we can implement this using existing mechanisms
to prototype something to show it is feasible and fast enough.  And
for existing linux/glibc installs that don't yet have the new
interfaces.

> It could work like this:
> 
> 1) backtrace requested in the kernel (e.g. to a perf counter
> overflow)
> 
> 2) kernel unwinds itself to the userspace boundary the usual way
> 
> 3) kernel forms a nonce (e.g. by incrementing a 64-bit counter)
> 
> 4) kernel logs a stack trace the usual way (e.g. to the ftrace ring
> buffer), but with the final frame referring to the nonce created in
> the previous step
> 
> 5) kernel queues a signal (one userspace has explicitly opted into
> via a new prctl()); the siginfo_t structure encodes (e.g. via
> si_status and si_value) the nonce

So before it does this prctl the process needs to setup all
datastructures it needs to safely handle the unwinding during signal
handling. That does mean that early process setup won't be able to be
profiled with user backtraces.

> 6) kernel eventually returns to userspace; queued signal handler
> gains control

So at this point, if the event triggered while that user space thread
was running the event logged by the kernel is basically just that
nonce?

> 7) signal handler unwinds the calling thread however it wants (and
> can sleep and take page faults if needed)

So in theory this can do anything that is async signal safe. But what
if it takes too long and another event gets triggered? Or it does a
syscall that produces an event? Another signal arrives? Or if it
causes a SEGV?

> 8) signal handler logs the result of its unwind, along with the
> nonce, to the system log (e.g. via a new system call, a sysfs write,
> an io_uring submission, etc.)

A (shared) memory region seems simplest, whatever has been put into it
when the signal handler returns is the user space backtrace
contribution. Maybe the prctl call can set this up? Or maybe the
kernel can provide it through one of the siginfo_t fields?

> Post-processing tools can associate kernel stacks with user stacks
> tagged with the corresponding nonces and reconstitute the full
> stacks in effect at the time of each logged event.
>
> We can avoid duplicating unwindgs too: at step #3, if the kernel
> finds that the current thread already has an unwind pending, it can
> uses the already-pending nonce instead of making a new one and
> queuing a signal: many kernel stacks can end with the same user
> stack "tail".

This is probably a generic optimization for most backtraces, most will
have a similar tail.

> One nice property of this scheme is that the userspace unwinding
> isn't limited to native code. Libc could arbitrate unwinding across
> an arbitrary number of managed runtime environments in the context
> of a single process: the system could be smart enough to know that
> instead of unwinding through, e.g. Python interpreter frames, the
> unwinder (which is normal userspace code, pluggable via DSO!) could
> traverse and log *Python* stack frames instead, with meaningful
> function names. And if you happened to have, say, a JavaScript
> runtime in the same process, both JavaScript and Python could
> participate in the semantic unwinding process.
>
> A pluggable userspace unwind mechanism would have zero cost in the
> case that we're not recording stack frames. On top of that, a
> pluggable userspace unwinder *could* be written to traverse frame
> pointers just as the kernel unwinder does today, if userspace thinks
> that's the best option. Without breaking kernel ABI, that userspace
> unwinder could use DWARF, or ORC, or any other userspace unwinding
> approach. It's future-proof.

This is nice, but does need some coordination for handing off the
unwinding context between different unwinders. You could register an
unwinder by memory region (if address is in this range, it is in the
lua interpreter). And then you can start generating a backtrace using
the dedicated unwinder for that memory region. That original unwinder
can start with the full ucontext_t. But what is the contract between
unwinders? If we start with e.g. a frame-pointer unwinder then when we
get to a part that needs to use the python unwinder we only have a pc,
sp and fp left. Is that enough context for the python unwinder to
continue?

What would we need to prototype this idea and show that we can produce
quick backtraces using fast eh_frame unwinding before we convinced the
kernel to provide this interface and have it in glibc (or the vdso,
without any fancy caching, it might fit the vdso)?

We could use LD_PRELOAD or some ptrace parasite code like criu uses to
insert the unwinder code and trigger registration, use an itimer and
SIGPROF as signal and shared memory to use as ringbuffer to provide
the backtrace addresses (and possible other context - build-id
mappings?) for a profiling app/library to read events from. Without
explicit kernel support that might not feel like system wide
profiling, but should give us a feel of how well it would work. Or are
there other holes/missing functionality?

Cheers,

Mark

> [1] https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx/message/646XXHGEGOKO465LQKWCPPPAZBSW5NWO/ 

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue