On Thu, May 14, 2020 at 10:33 AM <sdf@xxxxxxxxxx> wrote: > > On 05/13, Andrii Nakryiko wrote: > > This commits adds a new MPSC ring buffer implementation into BPF > > ecosystem, > > which allows multiple CPUs to submit data to a single shared ring buffer. > > On > > the consumption side, only single consumer is assumed. > [...] > > Comparison to alternatives > > -------------------------- > > Before considering implementing BPF ring buffer from scratch existing > > alternatives in kernel were evaluated, but didn't seem to meet the needs. > > They > > largely fell into few categores: > > - per-CPU buffers (perf, ftrace, etc), which don't satisfy two > > motivations > > outlined above (ordering and memory consumption); > > - linked list-based implementations; while some were multi-producer > > designs, > > consuming these from user-space would be very complicated and most > > probably not performant; memory-mapping contiguous piece of memory is > > simpler and more performant for user-space consumers; > > - io_uring is SPSC, but also requires fixed-sized elements. Naively > > turning > > SPSC queue into MPSC w/ lock would have subpar performance compared to > > locked reserve + lockless commit, as with BPF ring buffer. Fixed sized > > elements would be too limiting for BPF programs, given existing BPF > > programs heavily rely on variable-sized perf buffer already; > > - specialized implementations (like a new printk ring buffer, [0]) with > > lots > > of printk-specific limitations and implications, that didn't seem to > > fit > > well for intended use with BPF programs. > That's a very nice write up! Does it make sense to put most of it > under Documentation/bpf? We were discussing socket storage with KP > recently and I mentioned that commit 6ac99e8f23d4 has a really nice > description of the architecture with ascii diagrams/etc. Sometimes > it's really hard to chase down the commit history to find out > these sorts of details. Sure, can do that. And thanks :) > > > [0] https://lwn.net/Articles/779550/ > > > Signed-off-by: Andrii Nakryiko <andriin@xxxxxx> > > --- > > include/linux/bpf.h | 12 + > > include/linux/bpf_types.h | 1 + > > include/linux/bpf_verifier.h | 4 + > > include/uapi/linux/bpf.h | 33 ++- > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/helpers.c | 8 + > > kernel/bpf/ringbuf.c | 409 +++++++++++++++++++++++++++++++++ > > kernel/bpf/syscall.c | 12 + > > kernel/bpf/verifier.c | 156 ++++++++++--- > > kernel/trace/bpf_trace.c | 8 + > > tools/include/uapi/linux/bpf.h | 33 ++- > > 11 files changed, 643 insertions(+), 35 deletions(-) > > create mode 100644 kernel/bpf/ringbuf.c > [...] > > + * void bpf_ringbuf_submit(void *data) > > + * Description > > + * Submit reserved ring buffer sample, pointed to by *data*. > > + * Return > > + * Nothing. > Even though you mention self-pacing properties, would it still > make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output > to indicate whether to wake up userspace or not? Maybe something like > a threshold of number of outstanding events in the ringbuf after which > we do the wakeup? The default 0/1 preserve the existing behavior. > > The example I can give is a control plane userspace thread that > once a second aggregates the events, it doesn't care about millisecond > resolution. With the current scheme, I suppose, if BPF generates events > every 1ms, the userspace will be woken up 1000 times (if it can keep > up). Most of the time, we don't really care and some buffering > properties are desired. perf buffer has setting like this, and believe me, it's so confusing and dangerous, that I wouldn't want this to be exposed. Even though I was aware of this behavior, I still had to debug and work-around this lack on wakeup few times, it's really-really confusing feature. In your case, though, why wouldn't user-space poll data just once a second, if it's not interested in getting data as fast as possible? [...] > > +struct bpf_ringbuf { > > + wait_queue_head_t waitq; > > + u64 mask; > > + spinlock_t spinlock ____cacheline_aligned_in_smp; > > + u64 consumer_pos __aligned(PAGE_SIZE); > > + u64 producer_pos __aligned(PAGE_SIZE); > > + char data[] __aligned(PAGE_SIZE); > > +}; > > + > > +struct bpf_ringbuf_map { > > + struct bpf_map map; > > + struct bpf_map_memory memory; > > + struct bpf_ringbuf *rb; > > +}; > > + > > +static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int > > numa_node) > > +{ > > + const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | > > + __GFP_ZERO; > > + int nr_meta_pages = 2 + BPF_RINGBUF_PGOFF; > There is a bunch of magic '2's scattered around. Would it make sense > to use a proper define (with a comment) instead? Yep, it's consumer + producer counter pages, I'll add a constant. > > > + int nr_data_pages = data_sz >> PAGE_SHIFT; > > + int nr_pages = nr_meta_pages + nr_data_pages; > > + struct page **pages, *page; > > + size_t array_size; > > + void *addr; > > + int i; > > + [...] Please trim. I do love my code of course, but scrolling through it so many times gets old still ;)