Re: Maximum size of record over perf ring buffer?

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Sun, 19 Jul 2020 21:35:19 -0700

On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
<Kevin.Sheldrake@xxxxxxxxxxxxx> wrote:
>
> Hello
>
> I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd like to ask about.  I haven't managed to find documentation for the maximum size of a record that can be sent over the perf ring buffer, but experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests it is just short of 64KB.  Please could someone confirm if that's the case or not?  My experiments suggest that sending a record that is greater than 64KB results in the size reported in the callback being correct but the records overlapping, causing corruption if they are not serviced as quickly as they arrive.  Setting the record to exactly 64KB results in no records being received at all.
>
> For reference, I'm using perf_buffer__new() and perf_buffer__poll() on the userland side; and bpf_perf_event_output(ctx, &event_map, BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
>
> Additionally, is there a better architecture for sending large volumes of data (>64KB) back from the EBPF program to userland, such as a different ring buffer, a map, some kind of shared mmaped segment, etc, other than simply fragmenting the data?  Please excuse my naivety as I'm relatively new to the world of EBPF.
>

I'm not aware of any such limitations for perf ring buffer and I
haven't had a chance to validate this. It would be great if you can
provide a small repro so that someone can take a deeper look, it does
sound like a bug, if you really get clobbered data. It might be
actually how you set up perfbuf, AFAIK, it has a mode where it will
override the data, if it's not consumed quickly enough, but you need
to consciously enable that mode.

But apart from that, shameless plug here, you can try the new BPF ring
buffer ([0]), available in 5.8+ kernels. It will allow you to avoid
extra copy of data you get with bpf_perf_event_output(), if you use
BPF ringbuf's bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It
also has bpf_ringbuf_output() API, which is logically  equivalent to
bpf_perf_event_output(). And it has a very high limit on sample size,
up to 512MB per sample.

Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF
ringbuf across all CPUs, you might run into some contention across
multiple CPU. It is acceptable in a lot of applications I was
targeting, but if you have a high frequency of events (keep in mind,
throughput doesn't matter, only contention on sample reservation
matters), you might want to use an array of BPF ringbufs to scale
throughput. You can do 1 ringbuf per each CPU for ultimate performance
at the expense of memory usage (that's perf ring buffer setup), but
BPF ringbuf is flexible enough to allow any topology that makes sense
for you use case, from 1 shared ringbuf across all CPUs, to anything
in between.

  [0] https://patchwork.ozlabs.org/project/netdev/list/?series=180119&state=*

> Thank you in anticipation
>
> Kevin Sheldrake
>