On Thu, 7 Jan 2021, Andrii Nakryiko wrote: > We discussed this topic today at office hour. As I mentioned, I don't > know the ideal solution, but here is something that has enough > flexibility for real-world uses, while giving the performance and > convenience of reserve/commit API. Ignore naming, we can bikeshed that > later. > > So what we can do is introduce a new bpf_ringbuf_reserve() variant: > > bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void > *extra, __u64 extra_sz); > > The idea is that we reserve a fixed size amount of data that can be > used like it is today for filling a fixed-sized metadata/sample > directly. But the real size of the reserved sample is (size + > extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read > (kernel or user, depending on flags) data from extra and put it right > after the fixed-size part. > > So the use would be something like: > > struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m), > BPF_RB_PROBE_USER, env_vars, 1024); > > if (!m) > /* too bad, either probe_read_user failed or ringbuf is full */ > return 1; > > m->my_field1 = 123; > m->my_field2 = 321; > > > So the main problem with this is that when probe_read fails, we fail > reservation completely(internally we'd just discard ringbuf sample). > Is that OK? Or is it better to still reserve fixed-sized part and > zero-out the variable-length part? We are combining two separate > operations into a single API, so error handling is more convoluted. > > > Now, the main use case requested was to be able to fetch an array of > zero-terminated strings. I honestly don't think it's possible to > implement this efficiently without two copies of string data. Mostly > because to just determine the size of the string you have to read it > one extra time. And you'd probably want to copy string data into some > controlled storage first, so that you don't end up reading it once > successfully and then failing to read it on the second try. Next, when > you have multiple strings, how do you deal with partial failures? It's > even worse in terms of error handling and error propagation than the > fixed extra size variant described above. > > Ignoring all that, let's say we'd implement > bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying > multiple zero-terminated strings after the fixed-size prefix. Think > about implementation. Just to determine the total size of the ringbuf > sample, you'd need to read all strings once, and probably also copy > them locally. Then you'd reserve a ringbuf sample and copy all that > for the second time. So it's as inefficient as a BPF program > constructing a single block of memory by reading all such strings > manually into a per-CPU array and then using the above > bpf_ringbuf_reserve_extra(). > I ran into a variation of this problem with the ksnoop tool [1]. I'd hoped to use ringbuf, but the problem I had was I needed to store a series of N strings into a buffer, and for each I needed to specify a maximum size (for bpf_snprintf_btf()). However it was entirely possible that the strings would be a lot smaller, and they'd be copied one after the other, so while I needed to reserve a buffer for those N strings of (N * MAX_STRINGSIZE) size as the worst case scenario, it would likely be a lot smaller (the sum of the lengths of the N strings plus null termination), so there was no need to commit the unused space. I ended up using a BPF map-derived string buffer and perf events to send the events instead (the code for this is ksnoop.bpf.c in [1]). So all of this is to say that I'm assuming along with the reserve_extra() API, there'd need to be some sort of bpf_ringbuf_submit_extra(ringbuf, flags, extra_size) which specifies how much of the extra space was used? If that's the case I think this approach makes ringbuf usable for my scenario; the string buffer would effectively all be considered extra space, and we'd just submit what was used. However I _think_ you were suggesting above combining the probe read and the extra reservation as one operation; in my case that wouldn't work because the strings were written directly by a helper (bpf_snprintf_btf()) into the target buffer. It's probably an oddball situation of course, but thought I'd better mention it just in case. Thanks! Alan [1] https://lore.kernel.org/bpf/1609773991-10509-1-git-send-email-alan.maguire@xxxxxxxxxx/