Re: [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY

Philo Lu <lulie@xxxxxxxxxxxxxxxxx> · Thu, 28 Dec 2023 19:19:53 +0800

On 2023/12/28 02:02, Alexei Starovoitov wrote:
On Wed, Dec 27, 2023 at 2:01 AM Philo Lu <lulie@xxxxxxxxxxxxxxxxx> wrote:

The patch set introduce a new type of map, BPF_MAP_TYPE_RELAY, based on
relay interface [0]. It provides a way for persistent and overwritable data
transfer.

As stated in [0], relay is a efficient method for log and data transfer.
And the interface is simple enough so that we can implement and use this
type of map with current map interfaces. Besides we need a kfunc
bpf_relay_output to output data to user, similar with bpf_ringbuf_output.

We need this map because currently neither ringbuf nor perfbuf satisfies
the requirements of relatively long-term consistent tracing, where the bpf
program keeps writing into the buffer without any bundled reader, and the
buffer supports overwriting. For users, they just run the bpf program to
collect data, and are able to read as need. The detailed discussion can be
found at [1].

Hold on.
Earlier I mistakenly assumed that this relayfs is a multi producer
buffer instead of per-cpu.
Since it's actually per-cpu I see no need to introduce another per-cpu
ring buffer. We already have a perf_event buffer.

I think relay map and perfbuf don't conflict with each other, and relay 
map could be a better choice in some use cases (e.g., constant tracing). 
In our application, we output the tracing records as strings into relay 
files, and users just read it through `cat` without any process, which 
seems impossible to be implemented even with pinnable perfbuf.

Specifically, the advantages of relay map are summarized as follows:
(1) Read at any time without extra process: As discussed before, with 
relay map, bpf programs can keep writing into the buffer and users can 
read at any time.

(2) Custom data format: Unlike perfbuf processing data entry by entry 
(or event), the data format of relay is up to users. It could be simple 
string, or binary struct with a header, which provides users with high 
flexibility.

(3) Better performance: Due to the simple design, relay outperforms 
perfbuf in current bench_ringbufs (I added a relay map case to 
`tools/testing/selftests/bpf/benchs/bench_ringbufs.c` without other 
changes). Note that relay outputs data directly without notification, 
and the consumer can get a batch of samples using read() at a time.

Single-producer, parallel producer, sampled notification
========================================================
relaymap             51.652 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf            22.773 ± 0.015M/s (drops 0.000 ± 0.000M/s)
rb-custom            23.782 ± 0.004M/s (drops 0.000 ± 0.000M/s)
pb-libbpf            18.506 ± 0.007M/s (drops 0.000 ± 0.000M/s)
pb-custom            19.503 ± 0.007M/s (drops 0.000 ± 0.000M/s)

Single-producer, back-to-back mode
==================================
relaymap             44.771 ± 0.014M/s (drops 0.000 ± 0.000M/s)
rb-libbpf            25.091 ± 0.013M/s (drops 0.000 ± 0.000M/s)
rb-libbpf-sampled    24.779 ± 0.018M/s (drops 0.000 ± 0.000M/s)
rb-custom            27.784 ± 0.012M/s (drops 0.000 ± 0.000M/s)
rb-custom-sampled    27.414 ± 0.017M/s (drops 0.000 ± 0.000M/s)
pb-libbpf             1.409 ± 0.000M/s (drops 0.000 ± 0.000M/s)
pb-libbpf-sampled    18.467 ± 0.005M/s (drops 0.000 ± 0.000M/s)
pb-custom             1.415 ± 0.000M/s (drops 0.000 ± 0.000M/s)
pb-custom-sampled    19.913 ± 0.007M/s (drops 0.000 ± 0.000M/s)

Thanks.

Earlier you said:
"I can use BPF_F_PRESERVE_ELEMS flag to keep the
perf_events, but I do not know how to get the buffer again in a new process.
"

Looks like the issue is lack of map_fd_sys_lookup_elem callback ?
Solve the latter part.
perf_event_array_map should be pinnable like any other map,
so there is a way to get an FD to a map in a new process.
What's missing is a way to get an FD to perf event itself.