On Fri, May 29, 2020 at 5:36 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Thu, May 28, 2020 at 10:14:21PM -0700, Andrii Nakryiko wrote: > > > There is another cluster of applications which are unnecessarily more > > complicated just for the fact that there is no ordering between > > correlated events that happen on different CPUs. Per-CPU buffers are > > not well suited for such cases, unfortunately. > > And yet, I've been debugging concurrency issues with ftrace for well > over a decade. > > In fact, adding a spinlock in the mix will make a certain class of > concurrency problems go-away! because you introduce artificial > serialization between CPUs. > > Heck, even the ftrace buffer can sometimes make problems go way just > because of the overhead of tracing itself changes the timing. > All true, but seems like you are looking at this from purely tracing perspective, especially with very high-frequency events. And that's a very challenging domain for sure. But *a lot* of BPF applications (actually all I'm aware of at Facebook and outside of it) trace and collect much less high-frequency events, so they don't need to sacrifice so much to get ultimate performance. Performance profiling, which comes closest to what you are describing, is just one of many uses. > It is perfectly possible to reconstruct order with per-cpu buffers in so > far as that there is order to begin with. Esp on modern CPUs that have > synchronized TSC. > > Computers are not SC, pretending that events are is just a lie. Right, with precise enough clock or some atomic in-kernel counter, you can re-construct order in user-space. But that has its own downsides: you need to send this counter over the wire with every sample (takes more space, reducing amounf of "useful" payload you can fit in a ring buffer), plus user-space re-sorting takes engineering effort and isn't exactly trivial. Now multiply that by many teams who need this, and it becomes a problem worth solving. Few examples of events that do occur sequentially in an ordered manner, but due to per-cpu buffers might come back to user-space out of order: fork/exec/exit events and tcp connection state tracking. E.g., for short-lived process, fork can happen on one CPU, exit - on another within tiny period of time between each other. In such case, user-space might get exit event from one buffer before getting a fork event on another one. Similarly for TCP connection state transitions. For fork/exec/exit, typical rate of events, even on 80-core machines is on the order of thousands of events per second at most (typically, spikes do happen, unfortunately), so MPSC queue contention isn't a big deal. > > > > people, but apparently they've not spend enough time debugging stuff > > > with partial logs yet. Of course, bpf_prog_active already makes BPF > > > lossy, so maybe they went with that. > > > > Not sure which "partial logs" you mean, I'll assume dropped samples? > > Yep. Trying to reconstruct what actually happened from incomplete logs > is one of the most frustrating things in the world. > > > All production BPF applications are already designed to handle data > > loss, because it could and does happen due to running out of buffer > > space. Of course, amount of such drops is minimized however possible, > > but applications still need to be able to handle that. > > Running out of space is fixable and 'obvious'. Missing random events in > the middle is bloody infuriating. Even more so if you can't tell there's > gaps in the midle. > > AFAICT, you don't even mark a reservation fail.... ah, you leave that to > the user :-( And it can't tell if it's spurious or space related. BPF program cannot ignore reservation failure, it has to check that reserve() returned non-NULL pointer, verifier enforces that. It's true that out-of-space vs locking failed in NMI is indistinguishable. We'll see how important it is to distinguish in practice, there are very few applications that do run in NMI context. As for internal accounting of dropped samples, I've considered that and there is 4 bytes per-sample I can use to communicate it back to user-space, but it requires another atomic increment, which would just reduce performance further. Modern BPF applications actually have a very simple and straightforward ways to count that themselves with use of global variables, memory-mapped into user-space. So it's simple and fast to do. But again, we'll see in practice how that works for users and will adjust/enhance as necessary. > > Same with bpf_prog_active, it's silent 'random' data loss. You can > easily tell where a CPU buffer starts and stops, and thus if the events > are contained within, but not if there's random bits missing from the > middle. > > > Now, though, there will be more options. Now it's not just a question > > of whether to allocate a tiny 64KB per-CPU buffer on 80 core server > > and use reasonable 5MB for perfbuf overall, but suffer high and > > frequent data loss whenever a burst of incoming events happen. Or bump > > it up to, say, 256KB (or higher) and use 20MB+ now, which most of the > > time will be completely unused, but be able to absorb 4x more events. > > Now it might be more than enough to just allocate a single shared 5MB > > buffer and be able to absorb much higher spikes (of course, assuming > > not all CPUs will be spiking at the same time, in which case nothing > > can really help you much w.r.t. data loss). > > Muwhahaha, a single shared buffer with 80 CPUs! That's bloody murder on > performance. Loved the laugh :) But see above, a lot of interesting events are pretty low-frequency even on those giant servers. > > > So many BPF users are acutely aware of data loss and care a lot, but > > there are other constraints that they have to take into account. > > > > As for expensiveness of spinlock and atomics, a lot of applications we > > are seeing do not require huge throughput that per-CPU data structures > > provide, so such overhead is acceptable. Even under high contention, > > BPF ringbuf performance is pretty reasonable and will satisfy a lot of > > applications, see [1]. > > I've done benchmarks on contended atomic ops, and they go from ~20 > cycles (uncontended, cache hot) to well over 10k cycles when you jump on > them with say a dozen cores across a few nodes (more numa, more > horrible). > > > [1] https://patchwork.ozlabs.org/project/netdev/patch/20200526063255.1675186-5-andriin@xxxxxx/ > > From that: "Ringbuf, multi-producer contention", right? I read that as: > 'performance is bloody horrible if you add contention'. > > I suppose most of your users have very low event rates, otherwise I > can't see that working. Yes and yes :) Good thing is that with MPSC ringbuf, if contention is an issue, they can go with a model of 1 ringbuf for each cpu, similar to perfbuf, or something in between with few ringbufs for larger number of CPUs, again trading performance for memory, as necessary. This is easy to do in BPF by combining ringbuf map with ARRAY_OF_MAPS or HASH_OF_MAPS. > > > > All reasons why I never bother with BPF, aside from it being more > > > difficult than hacking up a kernel in the first place. > > > > It's not my goal to pitch BPF here, but for a lot of real-world use > > cases, hacking kernel is not an option at all, for many reasons. One > > of them is that kernel upgrades across huge fleet of servers take a > > long time, which teams can't afford to wait. In such cases, BPF is a > > perfect solution, which can't be beaten, as evidenced by a wide > > variety of BPF applications solving real problems. > > Yeah; lots of people use it because they really have nothing better for > their situation. > > As long as people understand the constraints (and that's a *BIG* if) I > suppose it's usable. > > It's just things I don't want to have to worry about. > > Anyway, all that said, I like how you did the commits, I should look to > see if I can retro-fit the perf buffer to have some of that. Once Thanks, and yeah, it would be actually great to have this reserve/commit API for perfbuf as well. Internally it actually does that, AFAIU, but all of that is enclosed in preempt_disable/preempt_enable region, not sure how practical and easy it is to split this up and let BPF verifier enforce that each reservation is followed by commit. Having this kind of API would allow to eliminate some unfortunate code patterns with using extra per-CPU array just to prepare a record before sending it over perfbuf with perf_event_output(). > question though; why are you using xchg() for the commit? Isn't that > more expensive than it should be? > > That is, why isn't that: > > smp_store_release(&hdr->len, new_len); > > ? Or are you needing the smp_mb() for the store->load ordering for the > ->consumer_pos load? That really needs a comment. Yeah, smp_store_release() is not strong enough, this memory barrier is necessary. And yeah, I'll follow up with some more comments, that's been what Joel requested as well. > > I think you can get rid of the smp_load_acquire() there, you're ordering > a load->store and could rely on the branch to do that: > > cons_pos = READ_ONCE(&rb->consumer_pos) & rb->mask; > if ((flags & BPF_RB_FORCE_WAKEUP) || (cons_pos == rec_pos && !(flags &BPF_RB_NO_WAKEUP)) > irq_work_queue(&rq->work); > > should be a control dependency. Could be. I tried to keep consistent smp_load_acquire/smp_store_release usage to keep it simpler. It might not be the absolutely minimal amount of ordering that would still be correct. We might be able to tweak and tune this without changing correctness. Anyways, thanks for taking a look and feedback. I hope some of my examples explain why this might be a fine approach for a lot of applications, even if not the most performant one.