On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> wrote: > > On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > Tracing BPF programs execute from tracepoints and kprobes where running > > context is unknown, but they need to request additional memory. > > The prior workarounds were using pre-allocated memory and BPF specific > > freelists to satisfy such allocation requests. Instead, introduce > > __GFP_TRYLOCK flag that makes page allocator accessible from any context. > > It relies on percpu free list of pages that rmqueue_pcplist() should be > > able to pop the page from. If it fails (due to IRQ re-entrancy or list > > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock > > and refill percpu freelist as normal. > > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, > > so trylock is the only option. > > The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF > where it is not known how much memory will be needed and what the > calling context is. Exactly. > I hope it does not spread across the kernel where > people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they > learn that this does not work, add this flag to the mix to make it work > without spending some time on reworking it. We can call it __GFP_BPF to discourage any other usage, but that seems like an odd "solution" to code review problem. If people start using __GFP_TRYLOCK just to shut up lockdep splats they will soon realize that it's an _oportunistic_ allocator. bpf doesn't need more than a page and single page will likely will be found in percpu free page pool, so this opportunistic approach will work most of the time for bpf, but unlikely for others that need order >= PAGE_ALLOC_COSTLY_ORDER (3). > Side note: I am in the process of hopefully getting rid of the > preempt_disable() from trace points. What remains then is attaching BPF > programs to any code/ function with a raw_spinlock_t and I am not yet > sure what to do here. That won't help the bpf core. There are tracepoints that are called after preemption is disabled. The worst is trace_contention_begin() and people have good reasons to attach bpf prog there to collect contention stats. In such case bpf prog has no idea what kind of spin_lock is contending. It might have disabled preemption and/or irqs before getting to that tracepoint. So removal of preempt_disable from tracepoint handling logic doesn't help bpf core. It's a good thing to do anyway.