BPF ring buffer variable-length data appending

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 7 Jan 2021 11:48:41 -0800

We discussed this topic today at office hour. As I mentioned, I don't
know the ideal solution, but here is something that has enough
flexibility for real-world uses, while giving the performance and
convenience of reserve/commit API. Ignore naming, we can bikeshed that
later.

So what we can do is introduce a new bpf_ringbuf_reserve() variant:

bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
*extra, __u64 extra_sz);

The idea is that we reserve a fixed size amount of data that can be
used like it is today for filling a fixed-sized metadata/sample
directly. But the real size of the reserved sample is (size +
extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
(kernel or user, depending on flags) data from extra and put it right
after the fixed-size part.

So the use would be something like:

struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
BPF_RB_PROBE_USER, env_vars, 1024);

if (!m)
    /* too bad, either probe_read_user failed or ringbuf is full */
    return 1;

m->my_field1 = 123;
m->my_field2 = 321;

So the main problem with this is that when probe_read fails, we fail
reservation completely(internally we'd just discard ringbuf sample).
Is that OK? Or is it better to still reserve fixed-sized part and
zero-out the variable-length part? We are combining two separate
operations into a single API, so error handling is more convoluted.

Now, the main use case requested was to be able to fetch an array of
zero-terminated strings. I honestly don't think it's possible to
implement this efficiently without two copies of string data. Mostly
because to just determine the size of the string you have to read it
one extra time. And you'd probably want to copy string data into some
controlled storage first, so that you don't end up reading it once
successfully and then failing to read it on the second try. Next, when
you have multiple strings, how do you deal with partial failures? It's
even worse in terms of error handling and error propagation than the
fixed extra size variant described above.

Ignoring all that, let's say we'd implement
bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying
multiple zero-terminated strings after the fixed-size prefix. Think
about implementation. Just to determine the total size of the ringbuf
sample, you'd need to read all strings once, and probably also copy
them locally.  Then you'd reserve a ringbuf sample and copy all that
for the second time. So it's as inefficient as a BPF program
constructing a single block of memory by reading all such strings
manually into a per-CPU array and then using the above
bpf_ringbuf_reserve_extra().

But offloading that preparation to a BPF program bypasses all these
error handling and memory layout questions. It will be up to a BPF
program itself. From a kernel perspective, we just append a block of
memory with known (at runtime) size.

As a more restricted version of bpf_ringbuf_reserve_extra(), instead
of allowing reading arbitrary kernel or user-space memory in
bpf_ringbuf_reserve_extra() we can say that it has to be known and
initialized memory (like MAP_VALUE pointer), so helper knows that it
can just copy data directly.

Thoughts?

-- Andrii