[Resend] BPF ringbuf misses notifications due to improper coherence

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The BPF ringbuf defaults to a mechanism to deliver epoll notifications
only when the userspace seems to "catch up" to the last written entry.
This is done by comparing the consumer pointer to the head of the last
written entry, and if it's equal, a notification is sent.

During the effort of implementing ringbuf in aya [1] I observed that
the epoll loop will sometimes get stuck, entering the wait state but
never getting the notification it's supposed to get. The
implementation originally mirrored libbpf's logic, especially its use
of acquire and release memory operations. However, it turned out that
the use of causal memory model is not sufficient, and using a seq_cst
store is required to avoid anomalies as outlined below.

The release-acquire ordering permits the following anomaly to happen
(in a simplified model where writing a new entry atomically completes
without going through busy bit):

kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter)
user  : write c 2 -> read p 2

This is because the release-acquire model allows stale reads, and in
the case above the stale reads means that none of the causal effect
can prevent this anomaly from happening. In order to prevent this
anomaly, a total ordering needs to be enforced on producer and
consumer writes. (Interestingly, it doesn't need to be enforced on
reads, however.)

If this is correct, then the fix needed right now is to correct
libbpf's stores to be sequentially consistent. On the kernel side,
however, we have something weird, probably inoptimal, but still
correct. The kernel uses xchg when clearing the BUSY flag [2]. This
doesn't sound like a necessary thing, since making the written event
visible only require release ordering. However, it's this xchg that
provides the other half of total ordering in order to prevent the
anomalies, as it performs a smp_mb, essentially upgrading the prior
store to seq_cst. If the intention was actually that, it would be
really obscure and hard-to-reason way to implement coherency. I'd
appreciate a clarification on this.

[1]: https://github.com/aya-rs/aya/pull/294#issuecomment-1144385687
[2]: https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/kernel/bpf/ringbuf.c#L384



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux