Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 13 May 2020 22:59:05 -0700

On Wed, May 13, 2020 at 2:59 PM Alan Maguire <alan.maguire@xxxxxxxxxx> wrote:
>
> On Wed, 13 May 2020, Andrii Nakryiko wrote:
>
> > This commits adds a new MPSC ring buffer implementation into BPF ecosystem,
> > which allows multiple CPUs to submit data to a single shared ring buffer. On
> > the consumption side, only single consumer is assumed.
> >
> > Motivation
> > ----------
> > There are two distinctive motivators for this work, which are not satisfied by
> > existing perf buffer, which prompted creation of a new ring buffer
> > implementation.
> >   - more efficient memory utilization by sharing ring buffer across CPUs;
> >   - preserving ordering of events that happen sequentially in time, even
> >   across multiple CPUs (e.g., fork/exec/exit events for a task).
> >
> > These two problems are independent, but perf buffer fails to satisfy both.
> > Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
> > also solved by having an MPSC implementation of ring buffer. The ordering
> > problem could technically be solved for perf buffer with some in-kernel
> > counting, but given the first one requires an MPSC buffer, the same solution
> > would solve the second problem automatically.
> >
>
> This looks great Andrii! One potentially interesting side-effect of
> the way this is implemented is that it could (I think) support speculative
> tracing.
>
> Say I want to record some tracing info when I enter function foo(), but
> I only care about cases where that function later returns an error value.
> I _think_ your implementation could support that via a scheme like
> this:
>
> - attach a kprobe program to record the data via bpf_ringbuf_reserve(),
>   and store the reserved pointer value in a per-task keyed hashmap.
>   Then record the values of interest in the reserved space. This is our
>   speculative data as we don't know whether we want to commit it yet.
>
> - attach a kretprobe program that picks up our reserved pointer and
>   commit()s or discard()s the associated data based on the return value.
>
> - the consumer should (I think) then only read the committed data, so in
>   this case just the data of interest associated with the failure case.
>
> I'm curious if that sort of ringbuf access pattern across multiple
> programs would work? Thanks!

Right now it's not allowed. Similar to spin lock and socket reference,
verifier will enforce that reserved record is committed or discarded
within the same BPF program invocation. Technically, nothing prevents
us from relaxing this and allowing to store this pointer in a map, but
that's probably way too dangerous and not necessary for most common
cases.

But all your troubles with this is due to using a pair of
kprobe+kretprobe. What I think should solve your problem is a single
fexit program. It can read input arguments *and* return value of
traced function. So there won't be any need for additional map and
storing speculative data (and no speculation as well, because you'll
just know beforehand if you even need to capture data). Does this work
for your case?

>
> Alan
>

[...]

no one seems to like trimming emails ;)