Il giorno ven 27 gen 2023 alle ore 19:54 Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> ha scritto: > > On Sun, Jan 15, 2023 at 9:10 AM andrea terzolo <andreaterzolo3@xxxxxxxxx> wrote: > > > > Il giorno ven 13 gen 2023 alle ore 23:57 Andrii Nakryiko > > <andrii.nakryiko@xxxxxxxxx> ha scritto: > > > > > > On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > > > On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote: > > > > > Hello! > > > > > > > > > > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF > > > > > map. Looking at the kernel implementation [0] it seems that data pages > > > > > are mapped 2 times to have a more efficient and simpler > > > > > implementation. This seems to be a ring buffer peculiarity, the perf > > > > > buffer didn't have such an implementation. In the Falco project [1] we > > > > > use huge per-CPU buffers to collect almost all the syscalls that the > > > > > system throws and the default size of each buffer is 8 MB. This means > > > > > that using the ring buffer approach on a system with 128 CPUs, we will > > > > > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The > > > > > > > > hum IIUC it's not allocated twice but pages are just mapped twice, > > > > to cope with wrap around samples, described in git log: > > > > > > > > One interesting implementation bit, that significantly simplifies (and thus > > > > speeds up as well) implementation of both producers and consumers is how data > > > > area is mapped twice contiguously back-to-back in the virtual memory. This > > > > allows to not take any special measures for samples that have to wrap around > > > > at the end of the circular buffer data area, because the next page after the > > > > last data page would be first data page again, and thus the sample will still > > > > appear completely contiguous in virtual memory. See comment and a simple ASCII > > > > diagram showing this visually in bpf_ringbuf_area_alloc(). > > > > > > yes, exactly, there is no duplication of memory, it's just mapped > > > twice to make working with records that wrap around simple and > > > efficient > > > > > > > Thank you very much for the quick response, my previous question was > > quite unclear, sorry for that, I will try to explain me better with > > some data. I've collected in this document [3] some thoughts regarding > > 2 simple examples with perf buffer and ring buffer. Without going into > > too many details about the document, I've noticed a strange value of > > "Resident set size" (RSS) in the ring buffer example. Probably is > > perfectly fine but I really don't understand why the "RSS" for each > > ring buffer assumes the same value of the Virtual memory size and I'm > > just asking myself if this fact could impact the OOM score computation > > making the program that uses ring buffers more vulnerable to the OOM > > killer. > > > > [3]: https://hackmd.io/@l56JYH1SS9-QXhSNXKanMw/r1Z8APWso > > > > I'm not an mm expert, unfortunately. Perhaps because we have twice as > many pages mapped (even though they are using only 8MB of physical > memory), it is treated as if process' RSS usage is 2x of that. I can > see how that might be a concern for OOM score, but I'm not sure what > can be done about this... > Yes, this is weird behavior. Unfortunately, a process that uses a ring buffer for each CPU is penalized from this point of view with respect to one that uses a perf buffer. Do you know by chance someone who can help us with this strange memory reservation? > > > > > > > > > issue is that this memory requirement could be too much for some > > > > > systems and also in Kubernetes environments where there are strict > > > > > resource limits... Our actual workaround is to use ring buffers shared > > > > > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for > > > > > example we allocate a ring buffer for each CPU pair. Unfortunately, > > > > > this solution has a price since we increase the contention on the ring > > > > > buffers and as highlighted here [2], the presence of multiple > > > > > competing writers on the same buffer could become a real bottleneck... > > > > > Sorry for the long introduction, my question here is, are there any > > > > > other approaches to manage such a scenario? Will there be a > > > > > possibility to use the ring buffer without the kernel double mapping > > > > > in the near future? The ring buffer has such amazing features with > > > > > respect to the perf buffer, but in a scenario like the Falco one, > > > > > where we have aggressive multiple producers, this double mapping could > > > > > become a limitation. > > > > > > > > AFAIK the bpf ring buffer can be used across cpus, so you don't need > > > > to have extra copy for each cpu if you don't really want to > > > > > > > > > > seems like they do share, but only between CPUs. But nothing prevents > > > you from sharing between more than 2 CPUs, right? It's a tradeoff > > > > Yes exactly, we can and we will do it > > > > > between contention and overall memory usage (but as pointed out, > > > ringbuf doesn't use 2x more memory). Do you actually see a lot of > > > contention when sharing ringbuf between 2 CPUs? There are multiple > > > > Actually no, I've not seen a lot of contention with this > > configuration, it seems to handle throughput quite well. BTW it's > > still an experimental solution so it is not much tested against > > real-world workloads. > > > > > applications that share a single ringbuf between all CPUs, and no one > > > really complained about high contention so far. You'd need to push > > > tons of data non-stop, probably, at which point I'd worry about > > > consumers not being able to keep up (and definitely not doing much > > > useful with all this data). But YMMV, of course. > > > > > > > We are a little bit worried about the single ring buffer scenario, > > mainly when we have something like 64 CPUs and all syscalls enabled, > > but as you correctly highlighted in this case we would have also some > > issues userspace side because we wouldn't be able to handle all this > > traffic, causing tons of event drops. BTW thank you for the feedback! > > > > If you decide to use ringbuf, I'd leverage its ability to be used > across multiple CPUs and thus reduce the OOM score concern. This is > what we see in practice here at Meta: at the same or even smaller > total amount of memory used for ringbuf(s), compared to perfbuf, we > see less (or no) event drops due to bigger shared buffer that can > absorb temporary spikes in the amount of events produced. > Thank you for the precious feedback about shared ring buffers, we are already experimenting with similar solutions to mitigate the OOM score issue, maybe this could be the right way to go also for our use case! > > > > jirka > > > > > > > > > > > > > > Thank you in advance for your time, > > > > > Andrea > > > > > > > > > > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107 > > > > > 1: https://github.com/falcosecurity/falco > > > > > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@xxxxxx/