Re: [PATCH bpf-next] bpf: ringbuf: Support consuming BPF_MAP_TYPE_RINGBUF from prog

Daniel Xu <dxu@xxxxxxxxx> · Tue, 10 Sep 2024 21:31:43 -0600

On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
> On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
> >
> > On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote:
> > > On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Sep 10, 2024, at 2:07 PM, Daniel Xu wrote:
> > > > > On Tue, Sep 10, 2024 at 01:41:41PM GMT, Andrii Nakryiko wrote:
> > > > >> On Tue, Sep 10, 2024 at 11:36 AM Alexei Starovoitov
> > > > [...]
> > > > >
> > > > >>
> > > > >> Also, Daniel, can you please make sure that dynptr we return for each
> > > > >> sample is read-only? We shouldn't let consumer BPF program ability to
> > > > >> corrupt ringbuf record headers (accidentally or otherwise).
> > > > >
> > > > > Sure.
> > > >
> > > > So the sample is not read-only. But I think prog is prevented from messing
> > > > with header regardless.
> > > >
> > > > __bpf_user_ringbuf_peek() returns sample past the header:
> > > >
> > > >         *sample = (void *)((uintptr_t)rb->data +
> > > >                            (uintptr_t)((cons_pos + BPF_RINGBUF_HDR_SZ) & rb->mask));
> > > >
> > > > dynptr is initialized with the above ptr:
> > > >
> > > >         bpf_dynptr_init(&dynptr, sample, BPF_DYNPTR_TYPE_LOCAL, 0, size);
> > > >
> > > > So I don't think there's a way for the prog to access the header thru the dynptr.
> > > >
> > >
> > > By "header" I mean 8 bytes that precede each submitted ringbuf record.
> > > That header is part of ringbuf data area. Given user space can set
> > > consumer_pos to arbitrary value, kernel can return arbitrary part of
> > > ringbuf data area, including that 8 byte header. If that data is
> > > writable, it's easy to screw up that header and crash another BPF
> > > program that reserves/submits a new record. User space can only read
> > > data area for BPF ringbuf, and so we rely heavily on a tight control
> > > of who can write what into those 8 bytes.
> >
> > Ah, ok. I think I understand.
> >
> > Given this and your other comments about rb->busy, what about enforcing
> > bpf_user_ringbuf_drain() NAND mmap? I think the use cases here are
> > different enough where this makes sense.
> 
> You mean disabling user-space mmap()? TBH, I'd like to understand the
> use case first before we make such decisions. Maybe what you need is
> not really a BPF ringbuf? Can you give us a bit more details on what
> you are trying to achieve?

BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each
entry in the cpumap. When a prog redirects to an entry in the cpumap,
the machinery queues up the xdp frame onto the destination CPU ptr_ring.
This can occur on any cpu, thus multi-producer. On processing side,
there is only the kthread created by the cpumap entry and bound to the
specific cpu that is consuming entries. So single consumer.

Goal is to track the latency overhead added from ptr_ring and the
kthread (versus softirq where is less overhead). Ideally we want p50,
p90, p95, p99 percentiles.

To do this, we need to track every single entry enqueue time as well as
dequeue time - events that occur in the tail are quite important.

Since ptr_ring is also a ring buffer, I thought it would be easy,
reliable, and fast to just create a "shadow" ring buffer. Every time
producer enqueues entries, I'd enqueue the same number of current
timestamp onto shadow RB. Same thing on consumer side, except dequeue
and calculate timestamp delta.

I was originally planning on writing my own lockless ring buffer in pure
BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I
could avoid that with this patch.

About disabling user-space mmap: yeah, that's what I was suggesting. I
think it'd be a bit odd if you wanted BPF RB to support consumption from
both userspace && prog at the same time. And since draining from prog is
new functionality (and thus the NAND isn't a regression), you could
relax the restriction later without issues.