Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF

andrea terzolo <andreaterzolo3@xxxxxxxxx> · Sun, 15 Jan 2023 18:09:52 +0100

Il giorno ven 13 gen 2023 alle ore 23:57 Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> ha scritto:
>
> On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> >
> > On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> > > Hello!
> > >
> > > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> > > map. Looking at the kernel implementation [0] it seems that data pages
> > > are mapped 2 times to have a more efficient and simpler
> > > implementation. This seems to be a ring buffer peculiarity, the perf
> > > buffer didn't have such an implementation. In the Falco project [1] we
> > > use huge per-CPU buffers to collect almost all the syscalls that the
> > > system throws and the default size of each buffer is 8 MB. This means
> > > that using the ring buffer approach on a system with 128 CPUs, we will
> > > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
> >
> > hum IIUC it's not allocated twice but pages are just mapped twice,
> > to cope with wrap around samples, described in git log:
> >
> >     One interesting implementation bit, that significantly simplifies (and thus
> >     speeds up as well) implementation of both producers and consumers is how data
> >     area is mapped twice contiguously back-to-back in the virtual memory. This
> >     allows to not take any special measures for samples that have to wrap around
> >     at the end of the circular buffer data area, because the next page after the
> >     last data page would be first data page again, and thus the sample will still
> >     appear completely contiguous in virtual memory. See comment and a simple ASCII
> >     diagram showing this visually in bpf_ringbuf_area_alloc().
>
> yes, exactly, there is no duplication of memory, it's just mapped
> twice to make working with records that wrap around simple and
> efficient
>

Thank you very much for the quick response, my previous question was
quite unclear, sorry for that, I will try to explain me better with
some data. I've collected in this document [3] some thoughts regarding
2 simple examples with perf buffer and ring buffer. Without going into
too many details about the document, I've noticed a strange value of
"Resident set size" (RSS) in the ring buffer example. Probably is
perfectly fine but I really don't understand why the "RSS" for each
ring buffer assumes the same value of the Virtual memory size and I'm
just asking myself if this fact could impact the OOM score computation
making the program that uses ring buffers more vulnerable to the OOM
killer.

[3]: https://hackmd.io/@l56JYH1SS9-QXhSNXKanMw/r1Z8APWso

> >
> > > issue is that this memory requirement could be too much for some
> > > systems and also in Kubernetes environments where there are strict
> > > resource limits... Our actual workaround is to use ring buffers shared
> > > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> > > example we allocate a ring buffer for each CPU pair. Unfortunately,
> > > this solution has a price since we increase the contention on the ring
> > > buffers and as highlighted here [2], the presence of multiple
> > > competing writers on the same buffer could become a real bottleneck...
> > > Sorry for the long introduction, my question here is, are there any
> > > other approaches to manage such a scenario? Will there be a
> > > possibility to use the ring buffer without the kernel double mapping
> > > in the near future? The ring buffer has such amazing features with
> > > respect to the perf buffer, but in a scenario like the Falco one,
> > > where we have aggressive multiple producers, this double mapping could
> > > become a limitation.
> >
> > AFAIK the bpf ring buffer can be used across cpus, so you don't need
> > to have extra copy for each cpu if you don't really want to
> >
>
> seems like they do share, but only between CPUs. But nothing prevents
> you from sharing between more than 2 CPUs, right? It's a tradeoff

Yes exactly, we can and we will do it

> between contention and overall memory usage (but as pointed out,
> ringbuf doesn't use 2x more memory). Do you actually see a lot of
> contention when sharing ringbuf between 2 CPUs? There are multiple

Actually no, I've not seen a lot of contention with this
configuration, it seems to handle throughput quite well. BTW it's
still an experimental solution so it is not much tested against
real-world workloads.

> applications that share a single ringbuf between all CPUs, and no one
> really complained about high contention so far. You'd need to push
> tons of data non-stop, probably, at which point I'd worry about
> consumers not being able to keep up (and definitely not doing much
> useful with all this data). But YMMV, of course.
>

We are a little bit worried about the single ring buffer scenario,
mainly when we have something like 64 CPUs and all syscalls enabled,
but as you correctly highlighted in this case we would have also some
issues userspace side because we wouldn't be able to handle all this
traffic, causing tons of event drops. BTW thank you for the feedback!

> > jirka
> >
> > >
> > > Thank you in advance for your time,
> > > Andrea
> > >
> > > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> > > 1: https://github.com/falcosecurity/falco
> > > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@xxxxxx/