Re: [Resend] BPF ringbuf misses notifications due to improper coherence

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 13 Jun 2022 13:26:37 -0700

On Mon, Jun 13, 2022 at 11:42 AM Tatsuyuki Ishi <ishitatsuyuki@xxxxxxxxx> wrote:
>
> The BPF ringbuf defaults to a mechanism to deliver epoll notifications
> only when the userspace seems to "catch up" to the last written entry.
> This is done by comparing the consumer pointer to the head of the last
> written entry, and if it's equal, a notification is sent.
>
> During the effort of implementing ringbuf in aya [1] I observed that
> the epoll loop will sometimes get stuck, entering the wait state but
> never getting the notification it's supposed to get. The
> implementation originally mirrored libbpf's logic, especially its use
> of acquire and release memory operations. However, it turned out that
> the use of causal memory model is not sufficient, and using a seq_cst
> store is required to avoid anomalies as outlined below.
>
> The release-acquire ordering permits the following anomaly to happen
> (in a simplified model where writing a new entry atomically completes
> without going through busy bit):
>
> kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter)
> user  : write c 2 -> read p 2
>
> This is because the release-acquire model allows stale reads, and in
> the case above the stale reads means that none of the causal effect
> can prevent this anomaly from happening. In order to prevent this
> anomaly, a total ordering needs to be enforced on producer and
> consumer writes. (Interestingly, it doesn't need to be enforced on
> reads, however.)
>
> If this is correct, then the fix needed right now is to correct
> libbpf's stores to be sequentially consistent. On the kernel side,
> however, we have something weird, probably inoptimal, but still
> correct. The kernel uses xchg when clearing the BUSY flag [2]. This
> doesn't sound like a necessary thing, since making the written event
> visible only require release ordering. However, it's this xchg that
> provides the other half of total ordering in order to prevent the
> anomalies, as it performs a smp_mb, essentially upgrading the prior
> store to seq_cst. If the intention was actually that, it would be
> really obscure and hard-to-reason way to implement coherency. I'd
> appreciate a clarification on this.
>
> [1]: https://github.com/aya-rs/aya/pull/294#issuecomment-1144385687
> [2]: https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/kernel/bpf/ringbuf.c#L384

Hey Tatsuyuki,

Sorry for not getting back in time, I haven't missed or forgot about
this, it's just in my TODO queue and I haven't had time to seriously
look at this. No one has reported this for libbpf previously, but it
could be because most people specify timeout on ring_buffer__poll() so
never noticed this issue.

I need a bit more time to page in all the context and recall semantics
around smp_load_acquire/smp_store_release and stuff like that. As for
xchg, if I remember correctly it was a deliberate choice, I remember
Paul suggesting that xchg is faster for what we needed with ringbuf.

I'd like to get to this in next few days, but meanwhile have you tried
to benchmark what are the implications of stricter memory ordering
changes on libbpf side? Do you have example changes you were thinking
about for libbpf side? I can try benchmarking it on my side as well.

Thanks!