Re: Some -serious- BPF-related litmus tests

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 28 May 2020 22:14:21 -0700

On Thu, May 28, 2020 at 3:17 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Thu, May 28, 2020 at 06:00:47PM -0400, Joel Fernandes wrote:
>
> > Any idea why this choice of locking-based ring buffer implementation in BPF?
> > The ftrace ring buffer can support NMI interruptions as well for writes.
> >
> > Also, is it possible for BPF to reuse the ftrace ring buffer implementation
> > or does it not meet the requirements?
>
> Both perf and ftrace are per-cpu, which, according to the patch
> description is too much memory overhead for them. Neither have ever
> considered anything else, atomic ops are expensive.

Right, as mentioned in commit description to [0], desire to use shared
ring buffer across multiple CPUs to save memory and absorb bigger
spikes with overall lower memory use was one of main motivations.
Ordered events was another one. Both perf buffer and ftrace buffer use
strictly per-CPU buffers, which in practice makes a lot of developers
make hard and non-obvious choice between using too much memory or
losing too much events due to running out of space in a buffer.

  [0] https://patchwork.ozlabs.org/project/netdev/patch/20200526063255.1675186-2-andriin@xxxxxx/

>
> On top of that, they want multi-producer support. Yes, doing that gets
> interesting really fast, but using spinlocks gets you a trainwreck like
> this.
>
> This thing so readily wanting to drop data on the floor should worry

It's true that *in NMI context*, if spinlock is already taken,
reservation will fail, so under high contention there will be
potentially high number of drops. So for such cases perfbuf might be a
better approach and BPF applications will have to deal with higher
memory usage. In practice, though, most BPF programs are not running
in NMI context, so there won't be any drop due to spinlock usage.
Having both perfbuf and this new BPF ringbuf gives people choice and
more options to tailor to their needs.

There is another cluster of applications which are unnecessarily more
complicated just for the fact that there is no ordering between
correlated events that happen on different CPUs. Per-CPU buffers are
not well suited for such cases, unfortunately.

> people, but apparently they've not spend enough time debugging stuff
> with partial logs yet. Of course, bpf_prog_active already makes BPF
> lossy, so maybe they went with that.

Not sure which "partial logs" you mean, I'll assume dropped samples?
All production BPF applications are already designed to handle data
loss, because it could and does happen due to running out of buffer
space. Of course, amount of such drops is minimized however possible,
but applications still need to be able to handle that.

Now, though, there will be more options. Now it's not just a question
of whether to allocate a tiny 64KB per-CPU buffer on 80 core server
and use reasonable 5MB for perfbuf overall, but suffer high and
frequent data loss whenever a burst of incoming events happen. Or bump
it up to, say, 256KB (or higher) and use 20MB+ now, which most of the
time will be completely unused, but be able to absorb 4x more events.
Now it might be more than enough to just allocate a single shared 5MB
buffer and be able to absorb much higher spikes (of course, assuming
not all CPUs will be spiking at the same time, in which case nothing
can really help you much w.r.t. data loss).

So many BPF users are acutely aware of data loss and care a lot, but
there are other constraints that they have to take into account.

As for expensiveness of spinlock and atomics, a lot of applications we
are seeing do not require huge throughput that per-CPU data structures
provide, so such overhead is acceptable. Even under high contention,
BPF ringbuf performance is pretty reasonable and will satisfy a lot of
applications, see [1].

  [1] https://patchwork.ozlabs.org/project/netdev/patch/20200526063255.1675186-5-andriin@xxxxxx/

>
> All reasons why I never bother with BPF, aside from it being more
> difficult than hacking up a kernel in the first place.

It's not my goal to pitch BPF here, but for a lot of real-world use
cases, hacking kernel is not an option at all, for many reasons. One
of them is that kernel upgrades across huge fleet of servers take a
long time, which teams can't afford to wait. In such cases, BPF is a
perfect solution, which can't be beaten, as evidenced by a wide
variety of BPF applications solving real problems.