Re: [PATCH 0/2] bpf: context casting for tail call and gtrace prog type

Kris Van Hees <kris.van.hees@xxxxxxxxxx> · Mon, 11 Mar 2019 10:21:27 -0400

On Thu, Mar 07, 2019 at 01:30:37PM -0800, Alexei Starovoitov wrote:
> On Tue, Mar 05, 2019 at 09:03:57PM -0500, Kris Van Hees wrote:
> > 
> > So in summary, I am trying to solve two (related) problems:
> > 
> > - Ensure that unprivileged tracing can obtain information about the task that
> >   triggered a probe or event.  There will always be limitations but we can do
> >   better than is available now.
> 
> I think unprivileged tracing is a huge topic on its own.
> It's too easy to create security holes with such mechanism.
> kprobe/tracepoints/etc have been historically root only and I don't see a way
> for them to become unpriv.

I agree it is a huge topic and one that needs very careful attention, but I
believe it certainly can be done (DTrace on Solaris provided it, and while we
haven't implemented it on Linux thus far, the availability of BPF actually
makes it more realistic).  This remains an important goal for us, and while
it will take a while to get there, we certain want to contribute in this area.

> imo the existing /proc/pid/status is already more powerful than
> what you're proposing with gtrace context.

THe problem with /proc/pid/status is that it cannot be read at the exact same
time as when a peobe fires.  And when you're doing tracing, that is one of the
main requirements: looking at the system at the moment of the probe firing.
Surely, /proc/pid/status contains more information right now but as I mentioned
in earlier email, the gtrace context I proposed so far was merely showing what
I am trying to do - it is nowhere near the final version.

> > - Allow tracing tools ab ability to provide actions to be performed when a
> >   probe or event fires, beyond what the individual BPF program types allow
> >   for the specific probe/event types (and do it in a generic manner, in a
> >   sense encapsulating multiple probe/event types in a more generic tracing
> >   context).
> 
> I think existing bpf tracing is generic whereas proposed gtrace is not generic at all.
> 'generic' is a loaded word. we can throw it back and forth and won't make
> any forward progress. Let's focus on technical bits, ok?

Sure, I definitely agree that we can mean very different things with 'generic'.
It is important then to explain what I mean here though because it is rather
crucial to the design.  From the perspective of tracing that we want to be able
to do we are mostly interested in being able to look at the system as a whole
through the sequence of probes that fire.  In that way, the specific context of
each probe is important, but the actions to be executed when probes fire need
more than just that probe context - they also need information about the task
that is executing.  An example would be to trace whether non-root executables
(often scripts) make use of setuid executables that could potentially be a
security risk.  You end up tracing through a possible complex tree of task
clones, tasks issueing exec syscalls, and performing file operations.  The
probes themselves are pretty meaningless without the larger context.  It is the
task context combined with the probe specifics that provide the information you
need to execute meaningful actions when a probe fires.  And that information
needs to be obtained when the probe fires - reading it from /proc when the
probe data reaches userspace is often too late.  Think of the case where you
need to trace information when an exec syscall is executed - by the time
tracing data is available in a buffer for userspace to process, the syscall
will usually have completed and userspace will only be able to obtain task
info about the state *after* the exec took place.

The majority of tracing use-cases that I encounter relate to observing one or
more tasks rather than probes without caring about the task context, and I
would argue that most use-cases I have read about from other people match this
observation,  Since we want to be able to use BPF as the execution engine for
the probe actions, it seems to make sense to me that the BPF context available
to those programs would therefore be providing the task data and probe data,
so that the programs can be more naturally written based on the context that
makes sense for the functionality they are providing.

So, my patches provide a initial implementation of a BPF program context (and
a mechanism to execute programs in it as a result of probes triggering) that
serves the perspective of the tracer looking at tasks.

> > A patch I am currently working on ties into this (and I hope to get it ready
> > sometime next week).  It builds on the support you already have for accessing
> > packet data from the __sk_buff context.  If we can make this same functionality
> > available to other contexts as well, my goal would be to make it possible for
> > the generic tracing context to have a buffer (data and data_end members) that
> > the BPF program can issue direct stores to as a means to allow a tracing
> > program to control how data is written into the buffer. 
> 
> sounds like you're trying to reinvent bpf_perf_event_output() mechanism.

It sure sounds like that, but the difference is that bpf_perf_event_output()
encapsulates data you are providing in a specific format already.  Most of the
tracers that support BPF in some way seems to pick a perf_event output type
that allows providing raw data and use that to write out what they need to pass
to userspace.

Our situation is a bit different because we have an existing tracing tool
(DTrace) with its own requirements on what the output buffer data is supposed
to look like.  When I looked at the options of using (somewhat abusing) the
bpf_perf_event_output() mechanism as a vehicle to get data to userspace vs
supporting DTrace's format, I concluded that being able to let the tracer
(DTrace in my case) define the output format is a benefit because it means
others can do the same (if they want to).  And given that DTrace works with
multiple buffer types and with things like speculation buffers (buffers that
are used as temporary output store in place of the default buffer - in a way
that the rest of the action need not be aware of - and written to the default
buffer upon commit, or discarded when not needed), bpf_perf_event_output() is
not sufficient.

> > But being able to do things like this without
> > needing to touch the context of any other BPF program type is a great benefit
> > to offer tracing tools, as far as I see it.
> 
> I still don't understand what you're referring to by 'things like this'
> that somehow will be possible in the future, but not possible today.
> Could you please give concrete example?

My apologies for not being clear.  I am referring to the features of the
gtrace context in terms of containing task information, and output buffers
to be used in BPF programs triggered from various probe sources (kprobe,
tracepoints, ...)  I would not want to suggest making changes to all the
different program contexts in order to support tracing needs because that
would be wrong.  Doing it in a central place makes it a lot easier to maintain
without impacting other program types, etc.

Of course, yes, bpf_probe_read() and bpf_perf_event_output() can be used
to implement a lot of what existing tracing tools like DTrace can do, if you
write them based on that.  One limitations I am obviously working with is
that DTrace already exists and has existed for a long time.  And while it is
100% available as open source, it involves a pretty involved set of patches to
be applied to the kernel to be able to use it which is just not ideal.  Hence
the goal to make it available by re-using as much of the existing features in
Linux as possible, while still maintaining the same level of functionality in
DTrace.  That means we need to fill the gaps - and from where I am sitting,
the ways to do that might as well be of use to others (if they want to).

If phrasing things in the context of DTrace would make the conversation easier
I certainly don;t mind doing that, but I really don't want to limit my patches
to supporting just DTrace (even if right now it might be the only tracer using
it).

	Kris