Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 14 May 2020 13:18:50 -0700

On Thu, May 14, 2020 at 10:33 AM <sdf@xxxxxxxxxx> wrote:
>
> On 05/13, Andrii Nakryiko wrote:
> > This commits adds a new MPSC ring buffer implementation into BPF
> > ecosystem,
> > which allows multiple CPUs to submit data to a single shared ring buffer.
> > On
> > the consumption side, only single consumer is assumed.
>

[...]

> > Comparison to alternatives
> > --------------------------
> > Before considering implementing BPF ring buffer from scratch existing
> > alternatives in kernel were evaluated, but didn't seem to meet the needs.
> > They
> > largely fell into few categores:
> >    - per-CPU buffers (perf, ftrace, etc), which don't satisfy two
> > motivations
> >      outlined above (ordering and memory consumption);
> >    - linked list-based implementations; while some were multi-producer
> > designs,
> >      consuming these from user-space would be very complicated and most
> >      probably not performant; memory-mapping contiguous piece of memory is
> >      simpler and more performant for user-space consumers;
> >    - io_uring is SPSC, but also requires fixed-sized elements. Naively
> > turning
> >      SPSC queue into MPSC w/ lock would have subpar performance compared to
> >      locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
> >      elements would be too limiting for BPF programs, given existing BPF
> >      programs heavily rely on variable-sized perf buffer already;
> >    - specialized implementations (like a new printk ring buffer, [0]) with
> > lots
> >      of printk-specific limitations and implications, that didn't seem to
> > fit
> >      well for intended use with BPF programs.
> That's a very nice write up! Does it make sense to put most of it
> under Documentation/bpf? We were discussing socket storage with KP
> recently and I mentioned that commit 6ac99e8f23d4 has a really nice
> description of the architecture with ascii diagrams/etc. Sometimes
> it's really hard to chase down the commit history to find out
> these sorts of details.

Sure, can do that. And thanks :)

>
> >    [0] https://lwn.net/Articles/779550/
>
> > Signed-off-by: Andrii Nakryiko <andriin@xxxxxx>
> > ---
> >   include/linux/bpf.h            |  12 +
> >   include/linux/bpf_types.h      |   1 +
> >   include/linux/bpf_verifier.h   |   4 +
> >   include/uapi/linux/bpf.h       |  33 ++-
> >   kernel/bpf/Makefile            |   2 +-
> >   kernel/bpf/helpers.c           |   8 +
> >   kernel/bpf/ringbuf.c           | 409 +++++++++++++++++++++++++++++++++
> >   kernel/bpf/syscall.c           |  12 +
> >   kernel/bpf/verifier.c          | 156 ++++++++++---
> >   kernel/trace/bpf_trace.c       |   8 +
> >   tools/include/uapi/linux/bpf.h |  33 ++-
> >   11 files changed, 643 insertions(+), 35 deletions(-)
> >   create mode 100644 kernel/bpf/ringbuf.c
>

[...]

> > + * void bpf_ringbuf_submit(void *data)
> > + *   Description
> > + *           Submit reserved ring buffer sample, pointed to by *data*.
> > + *   Return
> > + *           Nothing.
> Even though you mention self-pacing properties, would it still
> make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output
> to indicate whether to wake up userspace or not? Maybe something like
> a threshold of number of outstanding events in the ringbuf after which
> we do the wakeup? The default 0/1 preserve the existing behavior.
>
> The example I can give is a control plane userspace thread that
> once a second aggregates the events, it doesn't care about millisecond
> resolution. With the current scheme, I suppose, if BPF generates events
> every 1ms, the userspace will be woken up 1000 times (if it can keep
> up). Most of the time, we don't really care and some buffering
> properties are desired.

perf buffer has setting like this, and believe me, it's so confusing
and dangerous, that I wouldn't want this to be exposed. Even though I
was aware of this behavior, I still had to debug and work-around this
lack on wakeup few times, it's really-really confusing feature.

In your case, though, why wouldn't user-space poll data just once a
second, if it's not interested in getting data as fast as possible?

[...]

> > +struct bpf_ringbuf {
> > +     wait_queue_head_t waitq;
> > +     u64 mask;
> > +     spinlock_t spinlock ____cacheline_aligned_in_smp;
> > +     u64 consumer_pos __aligned(PAGE_SIZE);
> > +     u64 producer_pos __aligned(PAGE_SIZE);
> > +     char data[] __aligned(PAGE_SIZE);
> > +};
> > +
> > +struct bpf_ringbuf_map {
> > +     struct bpf_map map;
> > +     struct bpf_map_memory memory;
> > +     struct bpf_ringbuf *rb;
> > +};
> > +
> > +static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int
> > numa_node)
> > +{
> > +     const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN |
> > +                         __GFP_ZERO;
> > +     int nr_meta_pages = 2 + BPF_RINGBUF_PGOFF;
> There is a bunch of magic '2's scattered around. Would it make sense
> to use a proper define (with a comment) instead?

Yep, it's consumer + producer counter pages, I'll add a constant.

>
> > +     int nr_data_pages = data_sz >> PAGE_SHIFT;
> > +     int nr_pages = nr_meta_pages + nr_data_pages;
> > +     struct page **pages, *page;
> > +     size_t array_size;
> > +     void *addr;
> > +     int i;
> > +

[...]

Please trim. I do love my code of course, but scrolling through it so
many times gets old still ;)