Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

Kris Van Hees <kris.van.hees@xxxxxxxxxx> · Wed, 22 May 2019 01:23:27 -0400

On Tue, May 21, 2019 at 05:48:11PM -0400, Steven Rostedt wrote:
> On Tue, 21 May 2019 14:43:26 -0700
> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
> 
> > Steve,
> > sounds like you've missed all prior threads.
> 
> I probably have missed them ;-)
> 
> > The feedback was given to Kris it was very clear:
> > implement dtrace the same way as bpftrace is working with bpf.
> > No changes are necessary to dtrace scripts
> > and no kernel changes are necessary.
> 
> Kris, I haven't been keeping up on all the discussions. But what
> exactly is the issue where Dtrace can't be done the same way as the
> bpftrace is done?

There are several issues (and I keep finding new ones as I move forward) but
the biggest one is that I am not trying to re-design and re-implement) DTrace
from the ground up.  We have an existing userspace component that is getting
modified to work with a new kernel implementation (based on BPF and various
other kernel features that are thankfully available these days).  But we need
to ensure that the userspace component continues to function exactly as one
would expect.  There should be no need to modify DTrace scripts.  Perhaps
bpftrace could be taught to parse DTrace scripts (i.e. implement the D script
language with all its bells and whistles) but it currently cannot and DTrace
obviously can.  It seems to be a better use of resources to focus on the
kernel component, where we can really provide a much cleaner implementation
for DTrace probe execution because BPF is available and very powerful.

Userspace aside, there are various features that are not currently available
such as retrieving the ppid of the current task, and various other data items
that relate to the current task that triggered a probe.  There are ways to
work around it (using the bpf_probe_read() helper, which actually performs a
probe_kernel_read()) but that is rather clunky and definitely shouldn't be
something that can be done from a BPF program if we're doing unprivileged
tracing (which is a goal that is important for us).  New helpers can be added
for things like this, but the list grows large very quickly once you look at
what information DTrace scripts tend to use.

One of the benefits of DTrace is that probes are largely abstracted entities
when you get to the script level.  While different probes provide different
data, they are all represented as probe arguments and they are accessed in a
very consistent manner that is independent from the actual kind of probe that
triggered the execution.  Often, a single DTrace clause is associated with
multiple probes, of different types.  Probes in the kernel (kprobe, perf event,
tracepoint, ...) are associated with their own BPF program type, so it is not
possible to load the DTrace clause (translated into BPF code) once and
associate it with probes of different types.  Instead, I'd have to load it
as a BPF_PROG_TYPE_KPROBE program to associate it with a kprobe, and I'd have
to load it as a BPF_PROG_TYPE_TRACEPOINT program to associate it with a
tracepoint, and so on.  This also means that I suddenly have to add code to
the userspace component to know about the different program types with more
detail, like what helpers are available to specific program types.

Another advantage of being able to operate on a more abstract probe concept
that is not tied to a specific probe type is that the userspace component does
not need to know about the implementation details of the specific probes.
This avoids a tight coupling between the userspace component and the kernel
implementation.

Another feature that is currently not supported is speculative tracing.  This
is a feature that is not as commonly used (although I personally have found it
to be very useful in the past couple of years) but it quite powerful because
it allows for probe data to be recorded, and have the decision on whether it
is to be made available to userspace postponed to a later event.  At that time,
the data can be discarded or committed.

These are just some examples of issues I have been working on.  I spent quite
a bit of time to look for ways to implement what we need for DTrace with a
minimal amount of patches to the kernel because there really isn't any point
in doing unnecessary work.  I do not doubt that there are possible clever
ways to somehow get around some of these issues with clever hacks and
workarounds, but I am not trying to hack something together that hopefully
will be close enough to the expected functionality.

DTrace has proven itself to be quite useful and dependable as a tracing
solution, and I am working on continuing to deliver on that while recognizing
the significant work that others have put into advancing the tracing
infrastructure in Linux in recent years.  So many people have contributed
excellent features - and I am making use of those features as much as I can.
But as is often the case, not everything that I need is currently implemented.
As I expressed during last year's Plumbers in Vancouver, I am putting a very
strong emphasis on ensuring that what I propose as contributions is not
limited to just DTrace.  My goal is to work in an open, collaborative manner,
providing features that anyone can use if they want to.

I wish that the assertion that "no changes are necessary to dtrace scripts and
no kernel changes are necessary" were true, but my own findings contradict
that.  To my knowledge no tool exists right now that can execute any and all
valid DTrace scripts without any changes to the scripts and without any changes
to the kernel.  The only tool I know that can execute DTrace scripts right now
does require rather extensive kernel changes, and the work I am doing right now
is aimed at doing much better than that.

	Kris