On 7/9/22 12:05, Christian Hergert wrote: >> Why does the unwinding need to happen in the kernel? The kernel can >> already asynchronously invoke userspace code in the form of signal >> handlers. Is the problem that it is necessary to collect profiling >> information in the middle of a system call, where another syscall >> would see inconsistent (and potentially exploitable) kernel state? > > One of the primary values of system-wide profiling is being able to > see how a particular library call might have caused undesirable code > paths due to a syscall, and where/what was reached (given high enough > sampling rate). > > Does that need to happen in kernel space? I don't know, perse, other > than perf needs to be able to do that work as it is what gives us > the array of instruction pointers back. There was some chatter a > number of years ago in perf about how to handle ORC from user-space, > and if I'm summing this up correctly, it was basically.. > > - When sampling in PERF_CONTEXT_KERNEL, stop unwinding at the syscall boundary > - Append stacktrace samples to perf buffer ring > - Upon rescheduling, backtrace a single time into user-space, and expect > the consumer to know that N previous samples with matching task-id all > have the user-space backtrace. > > That's a pretty significant behavior change, and all tools would need > surgery to support it. I have no idea if that is paletable to either > side of the debate, but it was the one possible direction I saw. > > It does have a number of pros, in that you can save a lot of unwinding > time on syscall-heavy workloads by doing user-space unwinding once, > and from scheduler task queues (so you can take faults), and can > avoid the NMI context being the cost-center for accounting. But > the cons are significant in that the behavior change is expansive, > effects all tooling, and will require ORC data across the platform. This (or a variant of it) is the only reasonable solution I know of. The current situation is not acceptable, and a system-wide slowdown from -fno-omit-frame-pointer is also not acceptable. A solution like you suggested here will be much more work, but it will also be a much better product. >> Ouch. That is a serious problem for a number of reasons, not least >> of which is security. Having the kernel parse even more complex >> untrusted input in C is a horrible idea. > > It might seem that way by the description I gave, but we're just > talking an array of intptr_t or similar. There is no dereferencing or > state machines like you have with DWARF. Runtime resolution is also > essentially bsearch() on interval arrays. I really don't think it's > the sort of thing that requires Rust. bsearch() itself assumes that the input is trusted, but it should be possible to have a variant that does not make that assumption. Similarly, it should be possible to ensure that all user pointer access are guarded by checks to ensure they will not fault and actually point to userspace memory. > As for eBPF, we'd still probably be in NMI context with this route, > and would fail if we had to page in ORC tables. So that means we'd > either have to take a per-task memory overhead to maintain the mutated > form (probably unreasonable) or find a way for that to be done from > the task's space when returning from the syscall. The latter is definitely the better option. The NMI handler needs to be simpler, not more complex. One option would be to replace the normal syscall return with a return to a userspace trampoline. The trampoline would write the userspace backtrace to a kernel-provided buffer and then jump to the original return address. Some programs (such as LVM) would need to be able to opt-out of such profiling. LVM has critical sections where it is not safe to perform any I/O, as the device that backs the root filesystem might be suspended. Such a program would only be able to participate in unwinding if mlockall() was used. >> Christian, would this be sufficient for your needs? > > I don't think so without significant work. The best case I see here is > for perf to support user-space unwinding within the task, be it ORC > or DWARF, and that unwinder not have to necessarily be in-tree with > the kernel because we know they won't accept a DWARF unwinder again. Would it be sufficient with that significant work? > It does pose some questions on what would happen with carefully > crafted DWARF data. Processing untrusted DWARF data is probably a bad idea. -- Sincerely, Demi Marie Obenour (she/her/hers) _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure