Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 13 Dec 2023 15:35:19 -0800

On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2023/12/9 06:32, Andrii Nakryiko wrote:
> > On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@xxxxxxxxxx> wrote:
> >>
> >> On 07/12/2023 13:15, Philo Lu wrote:
> >>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
> >>> appreciate it if you give me any advice.
> >>>
> >>> Imagine a simple case: the bpf program output a log (some tcp
> >>> statistics) to user every time a packet is received, and the user
> >>> actively read the logs if he wants. I do not want to keep a user process
> >>> alive, waiting for outputs of the buffer. User can read the buffer as
> >>> need. BTW, the order does not matter.
> >>>
> >>> To conclude, I hope the buffer performs like relayfs: (1) no need for
> >>> user process to receive logs, and the user may read at any time (and no
> >>> wakeup would be better); (2) old data can be overwritten by new ones.
> >>>
> >>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
> >>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
> >>> full, the new data will be lost, until the buffer is consumed. (ii)
> >>> perfbuf: only satisfies (2). But user cannot access the buffer after the
> >>> process who creates it (including perf_event.rb via mmap) exits.
> >>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
> >>> perf_events, but I do not know how to get the buffer again in a new
> >>> process.
> >>>
> >>> In my opinion, this can be solved by either of the following: (a) add
> >>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
> >>> to address synchronization between kernel and user, especially under
> >>> variable data size, because when overwriting occurs, kernel has to
> >>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
> >>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
> >>> mechanism is need to preserve perf_event->rb when process exits
> >>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
> >>> if they are feasible, and which is better. If not, perhaps we can
> >>> develop another mechanism to achieve this?
> >>>
> >>
> >> There was an RFC a while back focused on supporting BPF ringbuf
> >> over-writing [1]; at the time, Andrii noted some potential issues that
> >> might be exposed by doing multiple ringbuf reserves to overfill the
> >> buffer within the same program.
> >>
> >
> > Correct. I don't think it's possible to correctly and safely support
> > overwriting with BPF ringbuf that has variable-sized elements.
> >
> > We'll need to implement MPMC ringbuf (probably with fixed sized
> > element size) to be able to support this.
> >
>
> Thank you very much!
>
> If it is indeed difficult with ringbuf, maybe I can implement a new type
> of bpf map based on relay interface [1]? e.g., init relay during map
> creating, write into it with bpf helper, and then user can access to it
> in filesystem. I think it will be a simple but useful map for
> overwritable data transfer.

I don't know much about relay, tbh. Give it a try, I guess.
Alternatively, we need better and faster implementation of
BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
support overwriting and generally be a fixed elementa size
alternative/complement to BPF ringbuf.

>
> [1]
> https://github.com/torvalds/linux/blob/master/Documentation/filesystems/relay.rst
>
> >> Alan
> >>
> >> [1]
> >> https://lore.kernel.org/lkml/20220906195656.33021-2-flaniel@xxxxxxxxxxxxxxxxxxx/