On Tue, Aug 30, 2022 at 6:28 AM David Vernet <void@xxxxxxxxxxxxx> wrote: > > On Wed, Aug 24, 2022 at 02:22:44PM -0700, Andrii Nakryiko wrote: > > > +/* Maximum number of user-producer ringbuffer samples that can be drained in > > > + * a call to bpf_user_ringbuf_drain(). > > > + */ > > > +#define BPF_MAX_USER_RINGBUF_SAMPLES BIT(17) > > > > nit: I don't think using BIT() is appropriate here. 128 * 1024 would > > be better, IMO. This is not inherently required to be a single bit > > constant. > > No problem, updated. > > > > + > > > static inline u32 bpf_map_flags_to_cap(struct bpf_map *map) > > > { > > > u32 access_flags = map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG); > > > @@ -2411,6 +2417,7 @@ extern const struct bpf_func_proto bpf_loop_proto; > > > extern const struct bpf_func_proto bpf_copy_from_user_task_proto; > > > extern const struct bpf_func_proto bpf_set_retval_proto; > > > extern const struct bpf_func_proto bpf_get_retval_proto; > > > +extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > > > [...] > > > + > > > +static void __bpf_user_ringbuf_sample_release(struct bpf_ringbuf *rb, size_t size, u64 flags) > > > +{ > > > + u64 producer_pos, consumer_pos; > > > + > > > + /* Synchronizes with smp_store_release() in user-space producer. */ > > > + producer_pos = smp_load_acquire(&rb->producer_pos); > > > + > > > + /* Using smp_load_acquire() is unnecessary here, as the busy-bit > > > + * prevents another task from writing to consumer_pos after it was read > > > + * by this task with smp_load_acquire() in __bpf_user_ringbuf_peek(). > > > + */ > > > + consumer_pos = rb->consumer_pos; > > > + /* Synchronizes with smp_load_acquire() in user-space producer. */ > > > + smp_store_release(&rb->consumer_pos, consumer_pos + size + BPF_RINGBUF_HDR_SZ); > > > + > > > + /* Prevent the clearing of the busy-bit from being reordered before the > > > + * storing of the updated rb->consumer_pos value. > > > + */ > > > + smp_mb__before_atomic(); > > > + atomic_set(&rb->busy, 0); > > > + > > > + if (!(flags & BPF_RB_NO_WAKEUP)) { > > > + /* As a heuristic, if the previously consumed sample caused the > > > + * ringbuffer to no longer be full, send an event notification > > > + * to any user-space producer that is epoll-waiting. > > > + */ > > > + if (producer_pos - consumer_pos == ringbuf_total_data_sz(rb)) > > > > I'm a bit confused here. This will be true only if user-space producer > > filled out entire ringbuf data *exactly* to the last byte with a > > single record. Or am I misunderstanding this? > > I think you're misunderstanding. This will indeed only be true if the ring > buffer was full (to the last byte as you said) before the last sample was > consumed, but it doesn't have to have been filled with a single record. > We're just checking that producer_pos - consumer_pos is the total size of > the ring buffer, but there can be many samples between consumer_pos and > producer_pos for that to be the case. you are right, never mind about single sample part, but I don't think that's the important part (just something that surprised me making everything even less realistic) > > > If my understanding is correct, how is this a realistic use case and > > how does this heuristic help at all? > > Though I think you may have misunderstood the heuristic, some more > explanation is probably warranted nonetheless. This heuristic being useful > relies on two assumptions: > > 1. It will be common for user-space to publish statically sized samples. > > I think this one is pretty unambiguously true, especially considering that > BPF_MAP_TYPE_RINGBUF was put to great use with statically sized samples for > quite some time. I'm open to hearing why that might not be the case. True, majority of use cases for BPF ringubf were fixed-sized, thanks to convenience of reserve/commit API. But data structure itself allows variable-sized and there are use cases doing this, plus with dynptr now it's easier to do variable-sized efficiently. So special-casing for fixed-sized sample a bit off, especially considering #2 > > 2. The size of the ring buffer is a multiple of the size of a sample. > > This one I think is a bit less clear. Users can always size the ring buffer > to make sure this will be the case, but whether or not that will be > commonly done is another story. so I'm almost certain this won't be the case. I don't think anyone is going to be tracking exact size of sample's struct (and it will most probably change with time) and then sizing ringbuf to be both power-of-2 of page_size *and* multiple of sizeof(struct my_ringbuf_sample) is something I don't see anyone doing. > > I'm fine with removing this heuristic for now if it's unclear that it's > serving a common use-case. We can always add it back in later if we want > to. Yes, this looks quite out of place with a bunch of optimistic but unrealistic assumptions. Doing one notification after drain will be fine for now, IMO. > > > > + irq_work_queue(&rb->work); > > > + > > > + } > > > +} > > > + > > > +BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map, > > > + void *, callback_fn, void *, callback_ctx, u64, flags) > > > +{ > > > + struct bpf_ringbuf *rb; > > > + long num_samples = 0, ret = 0; > > > + bpf_callback_t callback = (bpf_callback_t)callback_fn; > > > + u64 wakeup_flags = BPF_RB_NO_WAKEUP; > > > + > > > + if (unlikely(flags & ~wakeup_flags)) > > [...]