Re: [PATCH 0/2] bpf: context casting for tail call and gtrace prog type

Kris Van Hees <kris.van.hees@xxxxxxxxxx> · Tue, 5 Mar 2019 21:03:57 -0500

On Tue, Mar 05, 2019 at 10:59:52AM -0800, Alexei Starovoitov wrote:
> On Tue, Feb 26, 2019 at 01:46:01AM -0500, Kris Van Hees wrote:
> > On Mon, Feb 25, 2019 at 10:18:25PM -0800, Alexei Starovoitov wrote:
> > > On Mon, Feb 25, 2019 at 07:54:13AM -0800, Kris Van Hees wrote:
> > > > 
> > > > The goal is to further extend the BPF_PROG_TYPE_GTRACE implementation to
> > > > support what tracers commonly need, and I am also looking at ways to further
> > > > extend this model to allow more tracer-specific features as well without the
> > > > need for adding a BPF program types for every tracer.
> > > 
> > > It seems by themselves the patches don't provide any new functionality,
> > > but instead look like plumbing to call external code.
> > 
> > The patches are definitely not plumbing to call external code, and if I gave
> > that impression I apologise.  I overlooked the information you quote below on
> > allowing new functionality through modules when I wrote the comment above but
> > please note that it was a forward-looking comment in terms of what could be
> > done - not a reason for the patches that I submitted.
> > 
> > The patches accomplish something that is totally independent from that: they
> > make it possible for existing events that execute BPF programs when triggered
> > to transfer control to a BPF program with a more rich context.  The first
> > patch makes such a transfer possible (using tail-call combined with converting
> > the context to the new program type), and the second patch provides one such
> > program type (generic trace).  The new functionality provided by the program
> > type is direct access to task information that previously could only be
> > obtained through helper calls.  E.g. the new program type allows programs to
> > access the task state, prio, ppid, euid, and egid.  None of those pieces of
> > information can currently be obtained unless you start poking around in
> > memory using bpf_probe_read() helper calls.
> 
> I don't think I understand the problem you're trying to solve.
> >From kprobe/tracepoints/etc bpf prog can use bpf_probe_read() to read everything.
> Are you saying direct access to state, prio, ppid, euid, and egid via context
> is much superior? Why? Because it's more stable?

When you provide tracing to non-privileged users you definitely do not want
to allow BPF programs to access any memory they want in kernel space, yet you
would still want to be able to provide a decent amount of information about
tasks at time of probe firing.

> Why stop at these fields then? task_struct has many others.
>
> What we observed that no matter how many fields we add to stable uapi
> somebody will always request one more. For networking the total number of
> such fields is contained, but for tracing we're talking about thousands
> of useful fields. We cannot make them stable.
> Hence we've been working on alternative approach via BTF to make all
> of kernel internal fields sort-of stable via 'compile once' technique that
> we described at the last LPC.

Sure, but the ones I put in there were an example of how this can be used.
And again, in the case of unprivileged tracing, this easily becomes an issue
about where you end up enforcing what a tracing program can do and cannot do.
There will always be cases where more than the 'standard' information is
needed for a tracing task, and then it would be quite reasonable to conclude
that a higher level of privileges is required to accomplish that - but that
shouldn't prevent unprivileged tracing from being able to be useful as well.

Again, the limited set of fields I put in there right now is a matter of
showing how this can be used.  It is certainly meant to be expended quite a
bit.

The primary reason though behind the context conversion approach and the
generic tracing program type and context is that tracing on Linux based on
the existing kernel facilities limits the userspace tools because userspace
has quite limited control over what happens when a probe/event fires.  One
of the features of advanced tracing tools has been the ability to have more
(safe) control over what happens when the probe/event fires and how data is
stored in output buffers.  Since the userspace tool is the one requested data
and ultimately processes the generated data, it stands to reason that it
would benefit from being able to have more freedom in that area.  But that
means it needs to be able to provide a BPF program of a type that more closely
relates to the tracing tool functionality rather than the probe or event
itself (especially since probes and events are very specific, and by their
very nature should not really care about how userspace uses information).
This is again even more true for privileged tracing - right now there is a lot
of useful task information that you cannot get to without bpf_probe_read() but
unprivileged users really shouldn't be able to just read arbitrary kernel
memory.

So in summary, I am trying to solve two (related) problems:

- Ensure that unprivileged tracing can obtain information about the task that
  triggered a probe or event.  There will always be limitations but we can do
  better than is available now.
- Allow tracing tools ab ability to provide actions to be performed when a
  probe or event fires, beyond what the individual BPF program types allow
  for the specific probe/event types (and do it in a generic manner, in a
  sense encapsulating multiple probe/event types in a more generic tracing
  context).

A patch I am currently working on ties into this (and I hope to get it ready
sometime next week).  It builds on the support you already have for accessing
packet data from the __sk_buff context.  If we can make this same functionality
available to other contexts as well, my goal would be to make it possible for
the generic tracing context to have a buffer (data and data_end members) that
the BPF program can issue direct stores to as a means to allow a tracing
program to control how data is written into the buffer.  I am still working
out some details but I have a prototype working, and it retains all safety
provisions that BPF offres us.  But being able to do things like this without
needing to touch the context of any other BPF program type is a great benefit
to offer tracing tools, as far as I see it.

	Kris