Re: BPF ring buffer variable-length data appending

John Fastabend <john.fastabend@xxxxxxxxx> · Sat, 09 Jan 2021 16:57:55 -0800

Andrii Nakryiko wrote:
> On Thu, Jan 7, 2021 at 9:44 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
> >
> > Andrii Nakryiko wrote:
> > > We discussed this topic today at office hour. As I mentioned, I don't
> > > know the ideal solution, but here is something that has enough
> > > flexibility for real-world uses, while giving the performance and
> > > convenience of reserve/commit API. Ignore naming, we can bikeshed that
> > > later.
> >
> > Missed office hours today, dang sounds interesting. Apoligies if I'm
> > missing some context.
> >
> > >
> > > So what we can do is introduce a new bpf_ringbuf_reserve() variant:
> > >
> > > bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
> > > *extra, __u64 extra_sz);
> > >
> > > The idea is that we reserve a fixed size amount of data that can be
> > > used like it is today for filling a fixed-sized metadata/sample
> > > directly. But the real size of the reserved sample is (size +
> > > extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
> > > (kernel or user, depending on flags) data from extra and put it right
> > > after the fixed-size part.
> > >
> > > So the use would be something like:
> > >
> > > struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
> > > BPF_RB_PROBE_USER, env_vars, 1024);
> > >
> > > if (!m)
> > >     /* too bad, either probe_read_user failed or ringbuf is full */
> > >     return 1;
> > >
> > > m->my_field1 = 123;
> > > m->my_field2 = 321;
> > >
> >
> > Is this a way to increase the reserved space? I'm a bit fuzzy on the
> > details here. But what happens if something else has done another
> > reserve? Then it fails I guess?
> 
> No, you misunderstood. There is no way to safely increase the size of
> already reserved ringbuf record. Naming is hard.

Great without a scatter-gather list I couldn't see any way increasing
size would work ;)

> bpf_ringbuf_reserve_extra() reserve the correct size of the record
> from the very beginning. It's just that size is (size + extra_sz). The
> difference from non-extra() variant is that size is still restricted
> to be a known constant, so that verifier can verify safe direct memory
> reads and writes within first *size* bytes. But extra_sz can be
> unknown at the verification time (so it's ARG_ANYTHING), do it can be
> calculated dynamically. That's most of the difference. The other part
> is that because BPF program can't do much with this dynamic part of
> ringbuf record, we also immediately copy provided memory for later
> submission to the user space.

OK. Thanks for the extra details.

> 
> >
> > So consider,
> >
> > CPU0                   CPU1
> >
> > bpf_ringbuf_reserve()
> >                        bpf_ringbuf_reserve()
> > bpf_ringbuf_reserve_extra()
> >
> > Does that *_reserve_extra() fail then? If so it seems very limited
> > from a use perspective. Most the systems we work with will fail more
> > often than not I think.
> >
> > If the above doesn't fail, I'm missing something about how userspace
> > can know where that buffer is without a scatter-gather list.
> >
> > >
> > > So the main problem with this is that when probe_read fails, we fail
> > > reservation completely(internally we'd just discard ringbuf sample).
> > > Is that OK? Or is it better to still reserve fixed-sized part and
> > > zero-out the variable-length part? We are combining two separate
> > > operations into a single API, so error handling is more convoluted.
> >
> > My $.02 here. Failing is going to be ugly and a real pain to deal
> 
> If data copying failed, you are not getting data you expected. So
> failing seems reasonable in such case. The convoluted and unfortunate
> part is that you don't know whether it is ringbuf ran out of free
> space or memory copying failed. If we restring extra_data pointer to
> be a known good memory (like sk_buff, map_value, etc), then this
> concern goes away. But you also won't be able to use this directly to
> read some piece of generic kernel or user memory without extra
> (explicit) bpf_probe_read(). Which might be acceptable, I think.

Seems reasonable. Not knowing what failed seems like a horrible
to debug error case so better to fix that.

> 
> > with. I think best approach is reserve a fixed-sized buffer that
> > meets your 99% case or whatever. Then reserve some overflow buffers
> > you can point to for the oddball java application with a million
> > strings. Yes you need to get more complicated in userspace to manage
> > the thing, but once that codes written everything works out.
> >
> > Also I think we keep the kernel simpler if the BPF program just
> > does another reserve() if it needs more space so,
> >
> >  bpf_ringbuf_reserve()
> >  copy
> >  copy
> >  ENOMEM <- buff is full,
> >  bpf_ringbuf_reserve()
> >  copy
> >  copy
> >  ....
> >
> > Again userspace needs some logic to join the two buffers but we
> > could come up with some user side convention to do this. libbpf
> > for example could have a small buffer header to do this if folks
> > wanted.  Smart BPF programs can even reserve a couple buffers
> > up front for the worse case and recycle them back into its
> > next invocation, I think.
> 
> First, I don't think libbpf should do anything extra here. It's just
> bound to be suboptimal and cumbersome.

Agree, not a great idea. Users will want to do this using specifics
of their use case.

> 
> But yes, this is another approach, though arguably quite complicated.
> Chaining buffers can be done simply by using cpu_id and recording the
> total number of expected chunks in the very first chunk. You can
> submit the first chunk last, if you use reserve/commit API.
> Reconstruction in user-space is going to require memory allocations
> and copying, though.

Sure, although maybe not copying if you have a scatter gather list
somewhere. I guess it all depends on your use case.

[...]

> > >
> > > But offloading that preparation to a BPF program bypasses all these
> > > error handling and memory layout questions. It will be up to a BPF
> > > program itself. From a kernel perspective, we just append a block of
> > > memory with known (at runtime) size.
> >
> > Still missing how this would be different from multiple reserve()
> > calls. Its not too hard to join user space buffers I promise ;)
> 
> It's not trivial, but also probably less efficiently, as now you need
> to dynamically allocate memory and still do extra copy. So I can
> certainly see the appeal of being able to submit one whole record,
> instead of trying to re-construct intermingled chunks across multiple
> CPUs.
> 
> Anyways, I'm not advocating one or the other approach, both have the
> right to exist, IMO. There is no free lunch, unfortunately. Either way
> complexity and extra overhead creeps in.

Sure.

> 
> >
> > >
> > > As a more restricted version of bpf_ringbuf_reserve_extra(), instead
> > > of allowing reading arbitrary kernel or user-space memory in
> > > bpf_ringbuf_reserve_extra() we can say that it has to be known and
> > > initialized memory (like MAP_VALUE pointer), so helper knows that it
> > > can just copy data directly.
> >
> > This is a fairly common operation for us, but also just chunks of a map
> > value pointer. So would want a start/end offset bytes. Often our
> > map values have extra data that user space doesn't need or care about.
> 
> Huh? It's just a pointer, you can point inside the MAP_VALUE today,
> no? And you already have extra_sz. So this should work:
> 
> struct my_data *d = bpf_map_lookup_elem(&my_map, &my_key);
> if (!d) return 0;
> 
> bpf_ringbuf_reserve_extra(&rb, 100, 0, d + 100, 200);
> 
> And you'll be copying [100, 200) range into ringbuf. No?

Yep that works fine.

> 
> >
> > >
> > > Thoughts?
> >
> > I think I missed the point.
> 
> I think you misunderstood what bpf_ringbuf_reserve_extra() is going to
> do. And we might differ in evaluating how easy it is to handle
> chunking, both in kernel and user-space :)

Yep above clarifies the ringbuf_reserve_extra() proposal thanks. And
having done the work to handle chunking I don't think its terribly
difficult, but agree non-trivial. Once done though its actually
fairly flexible and handles use cases like array of strings and
variable data size structures (TLVs ;) well and seemingly
efficiently.

> 
> >
> > >
> > > -- Andrii