On Thu, May 21, 2020 at 6:21 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Sun, May 17, 2020 at 12:57:26PM -0700, Andrii Nakryiko wrote: > > + > > +static inline int roundup_len(__u32 len) > > +{ > > + /* clear out top 2 bits */ > > + len <<= 2; > > + len >>= 2; > > + /* add length prefix */ > > + len += RINGBUF_META_LEN; > > + /* round up to 8 byte alignment */ > > + return (len + 7) / 8 * 8; > > +} > > the same round_up again? > > > + > > +static void ringbuf_custom_process_ring(struct ringbuf_custom *r) > > +{ > > + unsigned long cons_pos, prod_pos; > > + int *len_ptr, len; > > + bool got_new_data; > > + > > + cons_pos = smp_load_acquire(r->consumer_pos); > > + while (true) { > > + got_new_data = false; > > + prod_pos = smp_load_acquire(r->producer_pos); > > + while (cons_pos < prod_pos) { > > + len_ptr = r->data + (cons_pos & r->mask); > > + len = smp_load_acquire(len_ptr); > > + > > + /* sample not committed yet, bail out for now */ > > + if (len & RINGBUF_BUSY_BIT) > > + return; > > + > > + got_new_data = true; > > + cons_pos += roundup_len(len); > > + > > + atomic_inc(&buf_hits.value); > > + } > > + if (got_new_data) > > + smp_store_release(r->consumer_pos, cons_pos); > > + else > > + break; > > + }; > > +} > > copy paste from libbpf? why? Yep, it's deliberate, see description of rb-custom benchmark in commit message. Basically, I was worried that generic libbpf callback calling might introduce a noticeable slowdown and wanted to test the case where ultimate peformance was necessary and someone would just implement custom reading loop with inlined processing logic. And it turned out to be noticeable, see benchmark results for rb-libbpf and rb-custom benchmarks. So I satisfied my curiosity :), but if you think that's not necessary, I can drop rb-custom (and, similarly, pb-custom for perfbuf). It still seems useful to me, though (and is sort of an example of minimal correct implementation of ringbuf/perfbuf reading). Btw, apart from speed up from avoiding indirect calls (due to callback), this algorithm also does "batched" smp_store_release(r->consumer_pos) only once after each inner `while(cons_pos < prod_pos)` loop, so there is theoretically less cache line bouncing, which might give a bit more speed as well. Libbpf pessimistically does smp_store_release() after each record in assumption that customer-provided callback might take a while, so it's better to give producers a bit more space back as early as possible.