Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF

andrea terzolo <andreaterzolo3@xxxxxxxxx> · Wed, 15 Feb 2023 23:00:33 +0100



Il giorno mer 15 feb 2023 alle ore 02:35 Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> ha scritto:
>
> On Sun, Feb 5, 2023 at 7:28 AM andrea terzolo <andreaterzolo3@xxxxxxxxx> wrote:
> >
> > Il giorno ven 27 gen 2023 alle ore 19:54 Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> ha scritto:
> > >
> > > On Sun, Jan 15, 2023 at 9:10 AM andrea terzolo <andreaterzolo3@xxxxxxxxx> wrote:
> > > >
> > > > Il giorno ven 13 gen 2023 alle ore 23:57 Andrii Nakryiko
> > > > <andrii.nakryiko@xxxxxxxxx> ha scritto:
> > > > >
> > > > > On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> > > > > > > Hello!
> > > > > > >
> > > > > > > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> > > > > > > map. Looking at the kernel implementation [0] it seems that data pages
> > > > > > > are mapped 2 times to have a more efficient and simpler
> > > > > > > implementation. This seems to be a ring buffer peculiarity, the perf
> > > > > > > buffer didn't have such an implementation. In the Falco project [1] we
> > > > > > > use huge per-CPU buffers to collect almost all the syscalls that the
> > > > > > > system throws and the default size of each buffer is 8 MB. This means
> > > > > > > that using the ring buffer approach on a system with 128 CPUs, we will
> > > > > > > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
> > > > > >
> > > > > > hum IIUC it's not allocated twice but pages are just mapped twice,
> > > > > > to cope with wrap around samples, described in git log:
> > > > > >
> > > > > >     One interesting implementation bit, that significantly simplifies (and thus
> > > > > >     speeds up as well) implementation of both producers and consumers is how data
> > > > > >     area is mapped twice contiguously back-to-back in the virtual memory. This
> > > > > >     allows to not take any special measures for samples that have to wrap around
> > > > > >     at the end of the circular buffer data area, because the next page after the
> > > > > >     last data page would be first data page again, and thus the sample will still
> > > > > >     appear completely contiguous in virtual memory. See comment and a simple ASCII
> > > > > >     diagram showing this visually in bpf_ringbuf_area_alloc().
> > > > >
> > > > > yes, exactly, there is no duplication of memory, it's just mapped
> > > > > twice to make working with records that wrap around simple and
> > > > > efficient
> > > > >
> > > >
> > > > Thank you very much for the quick response, my previous question was
> > > > quite unclear, sorry for that, I will try to explain me better with
> > > > some data. I've collected in this document [3] some thoughts regarding
> > > > 2 simple examples with perf buffer and ring buffer. Without going into
> > > > too many details about the document, I've noticed a strange value of
> > > > "Resident set size" (RSS) in the ring buffer example. Probably is
> > > > perfectly fine but I really don't understand why the "RSS" for each
> > > > ring buffer assumes the same value of the Virtual memory size and I'm
> > > > just asking myself if this fact could impact the OOM score computation
> > > > making the program that uses ring buffers more vulnerable to the OOM
> > > > killer.
> > > >
> > > > [3]: https://hackmd.io/@l56JYH1SS9-QXhSNXKanMw/r1Z8APWso
> > > >
> > >
> > > I'm not an mm expert, unfortunately. Perhaps because we have twice as
> > > many pages mapped (even though they are using only 8MB of physical
> > > memory), it is treated as if process' RSS usage is 2x of that. I can
> > > see how that might be a concern for OOM score, but I'm not sure what
> > > can be done about this...
> > >
> >
> > Yes, this is weird behavior. Unfortunately, a process that uses a ring
> > buffer for each CPU is penalized from this point of view with respect
> > to one that uses a perf buffer. Do you know by chance someone who can
> > help us with this strange memory reservation?
>
> So I checked with MM expert, and he confirmed that currently there is
> no way to avoid this double-accounting of memory reserved by BPF
> ringbuf. But this doesn't seem to be a problem unique to BPF ringbuf,
> generally RSS accounting is known to have problems with double
> counting memory in some situations.
>
Thank you for reporting this and for all the help in this thread,
really appreciated!

> One relatively clean suggested way to solve this problem would be to
> add a new memory counter (in addition to existing MM_SHMEMPAGES,
> MM_SWAPENTS, MM_ANONPAGES, MM_FILEPAGES) to compensate for cases like
> this.
>
> But it does look like a pretty big overkill here, tbh. Sorry, I don't
> have a good solution for you here.
>
> >
> > > > > >
> > > > > > > issue is that this memory requirement could be too much for some
> > > > > > > systems and also in Kubernetes environments where there are strict
> > > > > > > resource limits... Our actual workaround is to use ring buffers shared
> > > > > > > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> > > > > > > example we allocate a ring buffer for each CPU pair. Unfortunately,
> > > > > > > this solution has a price since we increase the contention on the ring
> > > > > > > buffers and as highlighted here [2], the presence of multiple
> > > > > > > competing writers on the same buffer could become a real bottleneck...
> > > > > > > Sorry for the long introduction, my question here is, are there any
> > > > > > > other approaches to manage such a scenario? Will there be a
> > > > > > > possibility to use the ring buffer without the kernel double mapping
> > > > > > > in the near future? The ring buffer has such amazing features with
> > > > > > > respect to the perf buffer, but in a scenario like the Falco one,
> > > > > > > where we have aggressive multiple producers, this double mapping could
> > > > > > > become a limitation.
> > > > > >
> > > > > > AFAIK the bpf ring buffer can be used across cpus, so you don't need
> > > > > > to have extra copy for each cpu if you don't really want to
> > > > > >
> > > > >
> > > > > seems like they do share, but only between CPUs. But nothing prevents
> > > > > you from sharing between more than 2 CPUs, right? It's a tradeoff
> > > >
> > > > Yes exactly, we can and we will do it
> > > >
> > > > > between contention and overall memory usage (but as pointed out,
> > > > > ringbuf doesn't use 2x more memory). Do you actually see a lot of
> > > > > contention when sharing ringbuf between 2 CPUs? There are multiple
> > > >
> > > > Actually no, I've not seen a lot of contention with this
> > > > configuration, it seems to handle throughput quite well. BTW it's
> > > > still an experimental solution so it is not much tested against
> > > > real-world workloads.
> > > >
> > > > > applications that share a single ringbuf between all CPUs, and no one
> > > > > really complained about high contention so far. You'd need to push
> > > > > tons of data non-stop, probably, at which point I'd worry about
> > > > > consumers not being able to keep up (and definitely not doing much
> > > > > useful with all this data). But YMMV, of course.
> > > > >
> > > >
> > > > We are a little bit worried about the single ring buffer scenario,
> > > > mainly when we have something like 64 CPUs and all syscalls enabled,
> > > > but as you correctly highlighted in this case we would have also some
> > > > issues userspace side because we wouldn't be able to handle all this
> > > > traffic, causing tons of event drops. BTW thank you for the feedback!
> > > >
> > >
> > > If you decide to use ringbuf, I'd leverage its ability to be used
> > > across multiple CPUs and thus reduce the OOM score concern. This is
> > > what we see in practice here at Meta: at the same or even smaller
> > > total amount of memory used for ringbuf(s), compared to perfbuf, we
> > > see less (or no) event drops due to bigger shared buffer that can
> > > absorb temporary spikes in the amount of events produced.
> > >
> >
> > Thank you for the precious feedback about shared ring buffers, we are
> > already experimenting with similar solutions to mitigate the OOM score
> > issue, maybe this could be the right way to go also for our use case!
>
> Hopefully this will work for you.
>
> >
> > > > > > jirka
> > > > > >
> > > > > > >
> > > > > > > Thank you in advance for your time,
> > > > > > > Andrea
> > > > > > >
> > > > > > > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> > > > > > > 1: https://github.com/falcosecurity/falco
> > > > > > > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@xxxxxx/