Re: [PATCH v2 bpf-next 6/7] bpf: add BPF ringbuf and perf buffer benchmarks

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 22 May 2020 12:07:46 -0700

On Thu, May 21, 2020 at 6:21 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Sun, May 17, 2020 at 12:57:26PM -0700, Andrii Nakryiko wrote:
> > +
> > +static inline int roundup_len(__u32 len)
> > +{
> > +     /* clear out top 2 bits */
> > +     len <<= 2;
> > +     len >>= 2;
> > +     /* add length prefix */
> > +     len += RINGBUF_META_LEN;
> > +     /* round up to 8 byte alignment */
> > +     return (len + 7) / 8 * 8;
> > +}
>
> the same round_up again?
>
> > +
> > +static void ringbuf_custom_process_ring(struct ringbuf_custom *r)
> > +{
> > +     unsigned long cons_pos, prod_pos;
> > +     int *len_ptr, len;
> > +     bool got_new_data;
> > +
> > +     cons_pos = smp_load_acquire(r->consumer_pos);
> > +     while (true) {
> > +             got_new_data = false;
> > +             prod_pos = smp_load_acquire(r->producer_pos);
> > +             while (cons_pos < prod_pos) {
> > +                     len_ptr = r->data + (cons_pos & r->mask);
> > +                     len = smp_load_acquire(len_ptr);
> > +
> > +                     /* sample not committed yet, bail out for now */
> > +                     if (len & RINGBUF_BUSY_BIT)
> > +                             return;
> > +
> > +                     got_new_data = true;
> > +                     cons_pos += roundup_len(len);
> > +
> > +                     atomic_inc(&buf_hits.value);
> > +             }
> > +             if (got_new_data)
> > +                     smp_store_release(r->consumer_pos, cons_pos);
> > +             else
> > +                     break;
> > +     };
> > +}
>
> copy paste from libbpf? why?

Yep, it's deliberate, see description of rb-custom benchmark in commit message.

Basically, I was worried that generic libbpf callback calling might
introduce a noticeable slowdown and wanted to test the case where
ultimate peformance was necessary and someone would just implement
custom reading loop with inlined processing logic. And it turned out
to be noticeable, see benchmark results for rb-libbpf and rb-custom
benchmarks. So I satisfied my curiosity :), but if you think that's
not necessary, I can drop rb-custom (and, similarly, pb-custom for
perfbuf). It still seems useful to me, though (and is sort of an
example of minimal correct implementation of ringbuf/perfbuf reading).

Btw, apart from speed up from avoiding indirect calls (due to
callback), this algorithm also does "batched"
smp_store_release(r->consumer_pos) only once after each inner
`while(cons_pos < prod_pos)` loop, so there is theoretically less
cache line bouncing, which might give a bit more speed as well. Libbpf
pessimistically does smp_store_release() after each record in
assumption that customer-provided callback might take a while, so it's
better to give producers a bit more space back as early as possible.