On Tue, Mar 11, 2025 at 02:32:24PM +0100, Alexei Starovoitov wrote: > On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > Tracing BPF programs execute from tracepoints and kprobes where > > > running context is unknown, but they need to request additional > > > memory. The prior workarounds were using pre-allocated memory and > > > BPF specific freelists to satisfy such allocation requests. > > > > The "prior workarounds" sound entirely appropriate. Because the > > performance and maintainability of Linux's page allocator is about > > 1,000,040 times more important than relieving BPF of having to carry a > > "workaround". > > Please explain where performance and maintainability is affected? > I have some related questions below. Note I'm a bystander, not claiming to have any (N)ACK power. A small bit before that: if (!spin_trylock_irqsave(&zone->lock, flags)) { if (unlikely(alloc_flags & ALLOC_TRYLOCK)) return NULL; spin_lock_irqsave(&zone->lock, flags); } This is going to perform worse when contested due to an extra access to the lock. I presume it was done this way to avoid suffering another branch, with the assumption the trylock is normally going to succeed. So happens I'm looking at parallel exec on the side and while currently there is bigger fish to fry, contention on zone->lock is very much a factor. Majority of it comes from RCU freeing (in free_pcppages_bulk()), but I also see several rmqueue calls below. As they trylock, they are going to make it more expensive for free_pcppages_bulk() to even get the lock. So this *is* contested, but at the moment is largely overshadowed by bigger problems (which someone(tm) hopefully will sort out sooner than later). So should this land, I expect someone is going to hoist the trylock at some point in the future. If it was my patch I would just do it now, but I understand this may result in new people showing up and complaining. > As far as motivation, if I recall correctly, you were present in > the room when Vlastimil presented the next steps for SLUB at > LSFMM back in May of last year. > A link to memory refresher is in the commit log: > https://lwn.net/Articles/974138/ > > Back then he talked about a bunch of reasons including better > maintainability of the kernel overall, but what stood out to me > as the main reason to use SLUB for bpf, objpool, mempool, > and networking needs is prevention of memory waste. > All these wrappers of slub pin memory that should be shared. > bpf, objpool, mempools should be good citizens of the kernel > instead of stealing the memory. That's the core job of the > kernel. To share resources. Memory is one such resource. > I suspect the worry is that the added behavior may complicate things down the road (or even prevent some optimizations) -- there is another context to worry about. I think it would help to outline why these are doing any memory allocation from something like NMI to begin with. Perhaps you could have carved out a very small piece of memory as a reserve just for that? It would be refilled as needed from process context. A general remark is that support for an arbitrary running context in core primitives artificially limits what can be done to optimize them for their most common users. imo the sheaves patchset is a little bit of an admission (also see below). It may be the get pages routine will get there. If non-task memory allocs got beaten to the curb, or at least got heavily limited, then a small allocator just for that purpose would do the trick and the two variants would likely be simpler than one thing which supports everyone. This patchset is a step in the opposite direction, but it may be there is a good reason. To my understanding ebpf can be used to run "real" code to do something or "merely" collect data. I presume the former case is already running from a context where it can allocate memory no problem. For the latter I presume ebpf has conflicting goals: 1. not disrupt the workload under observation (cpu and ram) -- to that end small memory usage limits are in place. otherwise a carelessly written aggregation can OOM the box (e.g., say someone wants to know which files get opened the most and aggregates on names, while a malicious user opens and unlinks autogenerated names, endlessly growing the list if you let it) 2. actually log stuff even if resources are scarce. to that end I would expect that a small area is pre-allocated and periodically drained Which for me further puts a question mark on general alloc from the NMI context. All that said, the cover letter: > The main motivation is to make alloc page and slab reentrant and > remove bpf_mem_alloc. does not justify why ebpf performs allocations in a manner which warrant any of this, which I suspect is what Andrew asked about. I never put any effort into ebpf -- it may be all the above makes excellent sense. But then you need to make a case to the people maintaining the code.