----- On Oct 26, 2020, at 8:05 AM, peter enderborg peter.enderborg@xxxxxxxx wrote: > On 10/23/20 9:53 PM, Michael Jeanson wrote: >> When invoked from system call enter/exit instrumentation, accessing >> user-space data is a common use-case for tracers. However, tracepoints >> currently disable preemption around iteration on the registered >> tracepoint probes and invocation of the probe callbacks, which prevents >> tracers from handling page faults. >> >> Extend the tracepoint and trace event APIs to allow specific tracer >> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being >> called from sleepable context, and convert the system call enter/exit >> instrumentation to sleepable tracepoints. > > Will this not be a problem for analyse of the trace? It get two > relevant times, one it when it is called and one when it returns. It will depend on what the tracer chooses to do. If we call the side-effect of what is being traced a "transaction" (e.g. actually opening a file descriptor and adding it to a process'file descriptor table as the result of an open(2) system call), we have to consider that already today the timestamp which we get is either slightly before or after the actual side-effect of the transaction in the kernel. That is true even without being preemptable. Sometimes it's not relevant to have a tracepoint before and after the transaction, e.g. when all we care about is to know that the transaction has successfully happened or not. In the case of system calls, we have sys_enter and sys_exit to mark the beginning and end of the "transaction". Whatever side-effects are done by the system call happens in between. I think the question here is whether it is relevant to know whether page faults triggered by accessing system call input parameters need to happen after we trace a "system call entry" event. If the tracers care, then it would be up to them to first trace that "system call entry" event, and have a separate event for the argument payload. But there are other ways to identify whether page faults happen within the system call or from user-space, for instance by using the instruction pointer associated with the page fault. So when observing page faults happening before sys enter, but associated with a kernel instruction pointer, a trace analysis tool could theoretically figure out who is to blame for that page fault, *if* it cares. > > It makes things harder to correlate in what order things happen. The alternative is to have partial payloads like LTTng does today for system call arguments. If reading a string from userspace (e.g. open(2) file name) requires to take a page fault, LTTng truncates the string. This is pretty bad for automated analysis as well. > > And handling of tracing of contexts that already are not preamptable? The sleepable tracepoints are only meant to be used in contexts which can sleep. For tracepoints placed in non-preemptible contexts, those should never take a page fault to begin with. > > Eg the same tracepoint are used in different places and contexts. As far as considering that a given tracepoint "name" could be registered to by both a sleepable and non-sleepable tracer probes, I would like to see an actual use-case for this. I don't have any. I can envision that some tracer code will want to be allowed to work in both sleepable and non-sleepable context, e.g. take page faults in sleepable context (and called from a sleepable tracepoint), but have a truncation behavior when called from non-sleepable context. This can actually be done by looking at the new "TRACEPOINT_MAYSLEEP" tp flag. Passing that tp_flags to code shared between sleepable and non-sleepable probes would allow the callee to know whether it can take a page fault or not. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com