On Mon, May 25, 2020 at 9:01 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > On Fri, May 22, 2020 at 11:46:49AM -0700, Andrii Nakryiko wrote: > > On Thu, May 21, 2020 at 5:25 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > On Sun, May 17, 2020 at 12:57:21PM -0700, Andrii Nakryiko wrote: > > > > This commits adds a new MPSC ring buffer implementation into BPF ecosystem, > > > > which allows multiple CPUs to submit data to a single shared ring buffer. On > > > > the consumption side, only single consumer is assumed. > > > > > > [ . . . ] > > > > > > Focusing just on the ring-buffer mechanism, with a question or two > > > below. Looks pretty close, actually! > > > > > > Thanx, Paul > > > > > > > Thanks for review, Paul! > > > > > > Signed-off-by: Andrii Nakryiko <andriin@xxxxxx> > > > > --- [...] > > > > + > > > > +static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) > > > > +{ > > > > + unsigned long cons_pos, prod_pos; > > > > + > > > > + cons_pos = smp_load_acquire(&rb->consumer_pos); > > > > > > What happens if there is a delay here? (The delay might be due to > > > interrupts, preemption in PREEMPT=y kernels, vCPU preemption, ...) > > > > > > If this is called from a producer holding the lock, then the only > > > ->consumer_pos can change, and that can only decrease the amount of > > > data available. Besides which, ->consumer_pos is sampled first. > > > But why would a producer care how much data was queued, as opposed > > > to how much free space was available? > > > > see second point below, it's for BPF program to implement custom > > notification policy/heuristics > > > > > > > > From the consumer, only ->producer_pos can change, and that can only > > > increase the amount of data available. (Assuming that producers cannot > > > erase old data on wrap-around before the consumer consumes it.) > > > > right, they can't overwrite data > > > > > > > > So probably nothing bad happens. > > > > > > On the bit about the producer holding the lock, some lockdep assertions > > > might make things easier on your future self. > > > > So this function is currently called in two situations, both not > > holding the spinlock. I'm fully aware this is a racy way to do this, > > but it's ok for the applications. > > > > So the first place is during initial poll "subscription", when we need > > to figure out if there is already some unconsumed data in ringbuffer > > to let poll/epoll subsystem know whether caller can do read without > > waiting. If this is done from consumer thread, it's race free, because > > consumer position won't change, so either we'll see that producer > > > consumer, or if we race with producer, producer will see that consumer > > pos == to-be-committed record pos, so will send notification. If epoll > > happens from non-consumer thread, it's similarly racy to perf buffer > > poll subscription implementation, but with next record subscriber will > > get notification anyways. > > > > The second place is from BPF side (so potential producer), but it's > > outside of producer lock. It's called from bpf_ringbuf_query() helper > > which is supposed to return various potentially stale properties of > > ringbuf (e.g., consumer/producer position, amount of unconsumed data, > > etc). The use case here is for BPF program to implement its own > > flexible poll notification strategy (or whatever else creative > > programmer will come up with, of course). E.g., batched notification, > > which happens not on every record, but only once enough data is > > accumulated. In this case exact amount of unconsumed data is not > > critical, it's only a heuristics. And it's more convenient than amount > > of data left free, IMO, but both can be calculated using the other, > > provided the total size of ring buffer is known (which can be returned > > by bpf_ringbuf_query() helper as well). > > > > So I think in both cases I don't want to take spinlock. If this is not > > done under spinlock, then I have to read consumer pos first, producer > > pos second, otherwise I can get into a situation of having > > consumer_pos > producer_pos (due to delays), which is really bad. > > > > It's a bit wordy explanation, but I hope this makes sense. > > Fair enough, and the smp_load_acquire() calls do telegraph the lockless > access. But what if timing results in the difference below coming out > negative, that is, a very large unsigned long value? This can happen only if, in the meantime, consumer_pos gets updated at least once, while producer_pos advances by at least 2^32 or 2^64, depending on architecture. This is essentially impossible on 64-bit arches, and extremely unlikely on 32-bit ones. So I'd say it is ok, especially given that this is for heuristics? > > > > > + prod_pos = smp_load_acquire(&rb->producer_pos); > > > > + return prod_pos - cons_pos; > > > > +} > > > > + > > > > +static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp, > > > > + struct poll_table_struct *pts) > > > > +{ > > > > + struct bpf_ringbuf_map *rb_map; > > > > + > > > > + rb_map = container_of(map, struct bpf_ringbuf_map, map); > > > > + poll_wait(filp, &rb_map->rb->waitq, pts); > > > > + > > > > + if (ringbuf_avail_data_sz(rb_map->rb)) > > > > + return EPOLLIN | EPOLLRDNORM; > > > > + return 0; > > > > +} > > > > + > > > > +const struct bpf_map_ops ringbuf_map_ops = { > > > > + .map_alloc = ringbuf_map_alloc, > > > > + .map_free = ringbuf_map_free, > > > > + .map_mmap = ringbuf_map_mmap, > > > > + .map_poll = ringbuf_map_poll, > > > > + .map_lookup_elem = ringbuf_map_lookup_elem, > > > > + .map_update_elem = ringbuf_map_update_elem, > > > > + .map_delete_elem = ringbuf_map_delete_elem, > > > > + .map_get_next_key = ringbuf_map_get_next_key, > > > > +}; > > > > + > > > > +/* Given pointer to ring buffer record metadata and struct bpf_ringbuf itself, > > > > + * calculate offset from record metadata to ring buffer in pages, rounded > > > > + * down. This page offset is stored as part of record metadata and allows to > > > > + * restore struct bpf_ringbuf * from record pointer. This page offset is > > > > + * stored at offset 4 of record metadata header. > > > > + */ > > > > +static size_t bpf_ringbuf_rec_pg_off(struct bpf_ringbuf *rb, > > > > + struct bpf_ringbuf_hdr *hdr) > > > > +{ > > > > + return ((void *)hdr - (void *)rb) >> PAGE_SHIFT; > > > > +} > > > > + > > > > +/* Given pointer to ring buffer record header, restore pointer to struct > > > > + * bpf_ringbuf itself by using page offset stored at offset 4 > > > > + */ > > > > +static struct bpf_ringbuf * > > > > +bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) > > > > +{ > > > > + unsigned long addr = (unsigned long)(void *)hdr; > > > > + unsigned long off = (unsigned long)hdr->pg_off << PAGE_SHIFT; > > > > + > > > > + return (void*)((addr & PAGE_MASK) - off); > > > > +} > > > > + > > > > +static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) > > > > +{ > > > > + unsigned long cons_pos, prod_pos, new_prod_pos, flags; > > > > + u32 len, pg_off; > > > > + struct bpf_ringbuf_hdr *hdr; > > > > + > > > > + if (unlikely(size > RINGBUF_MAX_RECORD_SZ)) > > > > + return NULL; > > > > + > > > > + len = round_up(size + BPF_RINGBUF_HDR_SZ, 8); > > > > + cons_pos = smp_load_acquire(&rb->consumer_pos); > > > > > > There might be a longish delay acquiring the spinlock, which could mean > > > that cons_pos was out of date, which might result in an unnecessary > > > producer-side failure. Why not pick up cons_pos after the lock is > > > acquired? After all, it is in the same cache line as the lock, so this > > > should have negligible effect on lock-hold time. > > > > Right, the goal was to minimize amount of time that spinlock is hold. > > > > Spinlock and consumer_pos are certainly not on the same cache line, > > consumer_pos is deliberately on a different *page* altogether (due to > > mmap() and to avoid unnecessary sharing between consumers, who don't > > touch spinlock, and producers, who do use lock to coordinate). > > > > So I chose to potentially drop sample due to a bit stale consumer pos, > > but speed up locked section. > > OK, fair enough! > > > > (Unless you had either really small cachelines or really big locks. > > > > > > > + if (in_nmi()) { > > > > + if (!spin_trylock_irqsave(&rb->spinlock, flags)) > > > > + return NULL; > > > > + } else { > > > > + spin_lock_irqsave(&rb->spinlock, flags); > > > > + } > > > > + > > > > + prod_pos = rb->producer_pos; > > > > + new_prod_pos = prod_pos + len; > > > > + > > > > + /* check for out of ringbuf space by ensuring producer position > > > > + * doesn't advance more than (ringbuf_size - 1) ahead > > > > + */ > > > > + if (new_prod_pos - cons_pos > rb->mask) { > > > > + spin_unlock_irqrestore(&rb->spinlock, flags); > > > > + return NULL; > > > > + } > > > > + > > > > + hdr = (void *)rb->data + (prod_pos & rb->mask); > > > > + pg_off = bpf_ringbuf_rec_pg_off(rb, hdr); > > > > + hdr->len = size | BPF_RINGBUF_BUSY_BIT; > > > > + hdr->pg_off = pg_off; > > > > + > > > > + /* ensure header is written before updating producer positions */ > > > > + smp_wmb(); > > > > > > The smp_store_release() makes this unnecessary with respect to > > > ->producer_pos. So what later write is it also ordering against? > > > If none, this smp_wmb() can go away. > > > > > > And if the later write is the xchg() in bpf_ringbuf_commit(), the > > > xchg() implies full barriers before and after, so that the smp_wmb() > > > could still go away. > > > > > > So other than the smp_store_release() and the xchg(), what later write > > > is the smp_wmb() ordering against? > > > > I *think* my intent here was to make sure that when consumer reads > > producer_pos, it sees hdr->len with busy bit set. But I also think > > that you are right and smp_store_release (producer_pos) ensures that > > if consumer does smp_read_acquire(producer_pos) it will see up-to-date > > hdr->len, for a given value of producer_pos. So yeah, I think I should > > drop smp_wmb(). I dropped it in litmus tests, and they still pass, > > which gives me a bit more confidence as well. :) > > > > I say "I *think*", because this algorithm started off completely > > locklessly without producer spinlock, so maybe I needed this barrier > > for something there, but I honestly don't remember details of lockless > > implementation by now. > > > > So in summary, yeah, I'll drop smp_wmb(). > > Sounds good! > > Thanx, Paul > > > > > + /* pairs with consumer's smp_load_acquire() */ > > > > + smp_store_release(&rb->producer_pos, new_prod_pos); > > > > + > > > > + spin_unlock_irqrestore(&rb->spinlock, flags); > > > > + > > > > + return (void *)hdr + BPF_RINGBUF_HDR_SZ; > > > > +} > > > > + > > > > +BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags) > > > > +{ > > > > + struct bpf_ringbuf_map *rb_map; > > > > + > > > > + if (unlikely(flags)) > > > > + return 0; > > > > + > > > > + rb_map = container_of(map, struct bpf_ringbuf_map, map); > > > > + return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size); > > > > +} > > > > + > > > > +const struct bpf_func_proto bpf_ringbuf_reserve_proto = { > > > > + .func = bpf_ringbuf_reserve, > > > > + .ret_type = RET_PTR_TO_ALLOC_MEM_OR_NULL, > > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > > + .arg2_type = ARG_CONST_ALLOC_SIZE_OR_ZERO, > > > > + .arg3_type = ARG_ANYTHING, > > > > +}; > > > > + > > > > +static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) > > > > +{ > > > > + unsigned long rec_pos, cons_pos; > > > > + struct bpf_ringbuf_hdr *hdr; > > > > + struct bpf_ringbuf *rb; > > > > + u32 new_len; > > > > + > > > > + hdr = sample - BPF_RINGBUF_HDR_SZ; > > > > + rb = bpf_ringbuf_restore_from_rec(hdr); > > > > + new_len = hdr->len ^ BPF_RINGBUF_BUSY_BIT; > > > > + if (discard) > > > > + new_len |= BPF_RINGBUF_DISCARD_BIT; > > > > + > > > > + /* update record header with correct final size prefix */ > > > > + xchg(&hdr->len, new_len); > > > > + > > > > + /* if consumer caught up and is waiting for our record, notify about > > > > + * new data availability > > > > + */ > > > > + rec_pos = (void *)hdr - (void *)rb->data; > > > > + cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask; > > > > + > > > > + if (flags & BPF_RB_FORCE_WAKEUP) > > > > + irq_work_queue(&rb->work); > > > > + else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP)) > > > > + irq_work_queue(&rb->work); > > > > +}