On Mon, Jun 13, 2022 at 11:42 AM Tatsuyuki Ishi <ishitatsuyuki@xxxxxxxxx> wrote: > > The BPF ringbuf defaults to a mechanism to deliver epoll notifications > only when the userspace seems to "catch up" to the last written entry. > This is done by comparing the consumer pointer to the head of the last > written entry, and if it's equal, a notification is sent. > > During the effort of implementing ringbuf in aya [1] I observed that > the epoll loop will sometimes get stuck, entering the wait state but > never getting the notification it's supposed to get. The > implementation originally mirrored libbpf's logic, especially its use > of acquire and release memory operations. However, it turned out that > the use of causal memory model is not sufficient, and using a seq_cst > store is required to avoid anomalies as outlined below. > > The release-acquire ordering permits the following anomaly to happen > (in a simplified model where writing a new entry atomically completes > without going through busy bit): > > kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter) > user : write c 2 -> read p 2 > > This is because the release-acquire model allows stale reads, and in > the case above the stale reads means that none of the causal effect > can prevent this anomaly from happening. In order to prevent this > anomaly, a total ordering needs to be enforced on producer and > consumer writes. (Interestingly, it doesn't need to be enforced on > reads, however.) > > If this is correct, then the fix needed right now is to correct > libbpf's stores to be sequentially consistent. On the kernel side, > however, we have something weird, probably inoptimal, but still > correct. The kernel uses xchg when clearing the BUSY flag [2]. This > doesn't sound like a necessary thing, since making the written event > visible only require release ordering. However, it's this xchg that > provides the other half of total ordering in order to prevent the > anomalies, as it performs a smp_mb, essentially upgrading the prior > store to seq_cst. If the intention was actually that, it would be > really obscure and hard-to-reason way to implement coherency. I'd > appreciate a clarification on this. > > [1]: https://github.com/aya-rs/aya/pull/294#issuecomment-1144385687 > [2]: https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/kernel/bpf/ringbuf.c#L384 Hey Tatsuyuki, Sorry for not getting back in time, I haven't missed or forgot about this, it's just in my TODO queue and I haven't had time to seriously look at this. No one has reported this for libbpf previously, but it could be because most people specify timeout on ring_buffer__poll() so never noticed this issue. I need a bit more time to page in all the context and recall semantics around smp_load_acquire/smp_store_release and stuff like that. As for xchg, if I remember correctly it was a deliberate choice, I remember Paul suggesting that xchg is faster for what we needed with ringbuf. I'd like to get to this in next few days, but meanwhile have you tried to benchmark what are the implications of stricter memory ordering changes on libbpf side? Do you have example changes you were thinking about for libbpf side? I can try benchmarking it on my side as well. Thanks!