Re: [PATCH bpf-next] bpf: ringbuf: Support consuming BPF_MAP_TYPE_RINGBUF from prog

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 11 Sep 2024 13:04:48 -0700

On Tue, Sep 10, 2024 at 8:31 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
>
> On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
> > On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
> > >
> > > On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote:
> > > > On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 10, 2024, at 2:07 PM, Daniel Xu wrote:
> > > > > > On Tue, Sep 10, 2024 at 01:41:41PM GMT, Andrii Nakryiko wrote:
> > > > > >> On Tue, Sep 10, 2024 at 11:36 AM Alexei Starovoitov
> > > > > [...]
> > > > > >
> > > > > >>
> > > > > >> Also, Daniel, can you please make sure that dynptr we return for each
> > > > > >> sample is read-only? We shouldn't let consumer BPF program ability to
> > > > > >> corrupt ringbuf record headers (accidentally or otherwise).
> > > > > >
> > > > > > Sure.
> > > > >
> > > > > So the sample is not read-only. But I think prog is prevented from messing
> > > > > with header regardless.
> > > > >
> > > > > __bpf_user_ringbuf_peek() returns sample past the header:
> > > > >
> > > > >         *sample = (void *)((uintptr_t)rb->data +
> > > > >                            (uintptr_t)((cons_pos + BPF_RINGBUF_HDR_SZ) & rb->mask));
> > > > >
> > > > > dynptr is initialized with the above ptr:
> > > > >
> > > > >         bpf_dynptr_init(&dynptr, sample, BPF_DYNPTR_TYPE_LOCAL, 0, size);
> > > > >
> > > > > So I don't think there's a way for the prog to access the header thru the dynptr.
> > > > >
> > > >
> > > > By "header" I mean 8 bytes that precede each submitted ringbuf record.
> > > > That header is part of ringbuf data area. Given user space can set
> > > > consumer_pos to arbitrary value, kernel can return arbitrary part of
> > > > ringbuf data area, including that 8 byte header. If that data is
> > > > writable, it's easy to screw up that header and crash another BPF
> > > > program that reserves/submits a new record. User space can only read
> > > > data area for BPF ringbuf, and so we rely heavily on a tight control
> > > > of who can write what into those 8 bytes.
> > >
> > > Ah, ok. I think I understand.
> > >
> > > Given this and your other comments about rb->busy, what about enforcing
> > > bpf_user_ringbuf_drain() NAND mmap? I think the use cases here are
> > > different enough where this makes sense.
> >
> > You mean disabling user-space mmap()? TBH, I'd like to understand the
> > use case first before we make such decisions. Maybe what you need is
> > not really a BPF ringbuf? Can you give us a bit more details on what
> > you are trying to achieve?
>
> BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each
> entry in the cpumap. When a prog redirects to an entry in the cpumap,
> the machinery queues up the xdp frame onto the destination CPU ptr_ring.
> This can occur on any cpu, thus multi-producer. On processing side,
> there is only the kthread created by the cpumap entry and bound to the
> specific cpu that is consuming entries. So single consumer.
>
> Goal is to track the latency overhead added from ptr_ring and the
> kthread (versus softirq where is less overhead). Ideally we want p50,
> p90, p95, p99 percentiles.
>
> To do this, we need to track every single entry enqueue time as well as
> dequeue time - events that occur in the tail are quite important.
>
> Since ptr_ring is also a ring buffer, I thought it would be easy,
> reliable, and fast to just create a "shadow" ring buffer. Every time
> producer enqueues entries, I'd enqueue the same number of current
> timestamp onto shadow RB. Same thing on consumer side, except dequeue
> and calculate timestamp delta.
>
> I was originally planning on writing my own lockless ring buffer in pure
> BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I
> could avoid that with this patch.

TBH, I'd just do a fixed-sized array and use atomic counters to
enqueue and consumer position to dequeue. But seems like you might get
away without any circular buffer, based on Jesper's suggestion.

>
> About disabling user-space mmap: yeah, that's what I was suggesting. I
> think it'd be a bit odd if you wanted BPF RB to support consumption from
> both userspace && prog at the same time. And since draining from prog is
> new functionality (and thus the NAND isn't a regression), you could
> relax the restriction later without issues.

I probably wouldn't disable mmap-ing from user space and let the user
handle coordination of consumers, if necessary (we could probably
expose the "busy" field through the consumer metadata page, if it's
not yet, to help with that a bit). This is more flexible as it allows
to alternate who's consuming, if necessary. Consumer side can't
compromise kernel-side producers (if we prevent writes from the
consumer side).

Still, all this feels a bit kludgy anyways. There is also unnecessary
epoll notification, which will be happening by default unless consumer
explicitly requests to not do notification.

Anyways, if there is a better way for your task, see if that works first.