Re: [RFC PATCH 0/6] Sleepable tracepoints

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Mon, 26 Oct 2020 10:59:39 -0400 (EDT)

----- On Oct 26, 2020, at 8:05 AM, peter enderborg peter.enderborg@xxxxxxxx wrote:

> On 10/23/20 9:53 PM, Michael Jeanson wrote:
>> When invoked from system call enter/exit instrumentation, accessing
>> user-space data is a common use-case for tracers. However, tracepoints
>> currently disable preemption around iteration on the registered
>> tracepoint probes and invocation of the probe callbacks, which prevents
>> tracers from handling page faults.
>>
>> Extend the tracepoint and trace event APIs to allow specific tracer
>> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
>> called from sleepable context, and convert the system call enter/exit
>> instrumentation to sleepable tracepoints.
> 
> Will this not be a problem for analyse of the trace? It get two
> relevant times, one it when it is called and one when it returns.

It will depend on what the tracer chooses to do. If we call the side-effect
of what is being traced a "transaction" (e.g. actually opening a file
descriptor and adding it to a process'file descriptor table as the result
of an open(2) system call), we have to consider that already today the
timestamp which we get is either slightly before or after the actual
side-effect of the transaction in the kernel. That is true even without
being preemptable.

Sometimes it's not relevant to have a tracepoint before and after the
transaction, e.g. when all we care about is to know that the transaction
has successfully happened or not.

In the case of system calls, we have sys_enter and sys_exit to mark the
beginning and end of the "transaction". Whatever side-effects are done by
the system call happens in between.

I think the question here is whether it is relevant to know whether page
faults triggered by accessing system call input parameters need to
happen after we trace a "system call entry" event. If the tracers care,
then it would be up to them to first trace that "system call entry" event,
and have a separate event for the argument payload. But there are other
ways to identify whether page faults happen within the system call or
from user-space, for instance by using the instruction pointer associated
with the page fault. So when observing page faults happening before sys
enter, but associated with a kernel instruction pointer, a trace analysis
tool could theoretically figure out who is to blame for that page fault,
*if* it cares.

> 
> It makes things harder to correlate in what order things happen.

The alternative is to have partial payloads like LTTng does today for
system call arguments. If reading a string from userspace (e.g. open(2)
file name) requires to take a page fault, LTTng truncates the string. This
is pretty bad for automated analysis as well.

> 
> And handling of tracing of contexts that already are not preamptable?

The sleepable tracepoints are only meant to be used in contexts which can sleep.
For tracepoints placed in non-preemptible contexts, those should never take
a page fault to begin with.

> 
> Eg the same tracepoint are used in different places and contexts.

As far as considering that a given tracepoint "name" could be registered to
by both a sleepable and non-sleepable tracer probes, I would like to see an
actual use-case for this. I don't have any.

I can envision that some tracer code will want to be allowed to work in
both sleepable and non-sleepable context, e.g. take page faults in
sleepable context (and called from a sleepable tracepoint), but have a
truncation behavior when called from non-sleepable context. This can actually
be done by looking at the new "TRACEPOINT_MAYSLEEP" tp flag. Passing that
tp_flags to code shared between sleepable and non-sleepable probes would allow
the callee to know whether it can take a page fault or not.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com