On Tue, Jan 14, 2025 at 2:39 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote: > > On Tue 14-01-25 10:53:55, Peter Zijlstra wrote: > > > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote: > > > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > > > > > Tracing BPF programs execute from tracepoints and kprobes where > > > > running context is unknown, but they need to request additional > > > > memory. > > > > > > > The prior workarounds were using pre-allocated memory and > > > > BPF specific freelists to satisfy such allocation requests. > > > > Instead, introduce gfpflags_allow_spinning() condition that signals > > > > to the allocator that running context is unknown. > > > > Then rely on percpu free list of pages to allocate a page. > > > > The rmqueue_pcplist() should be able to pop the page from. > > > > If it fails (due to IRQ re-entrancy or list being empty) then > > > > try_alloc_pages() attempts to spin_trylock zone->lock > > > > and refill percpu freelist as normal. > > > > > > > BPF program may execute with IRQs disabled and zone->lock is > > > > sleeping in RT, so trylock is the only option. > > > > > > how is spin_trylock() from IRQ context not utterly broken in RT? > > > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) > > + return NULL; > > > > Deals with that, right? > > Changelog didn't really mention that, did it? -- it seems to imply quite > the opposite :/ Hmm. Until you said it I didn't read it as "imply the opposite" :( The cover letter is pretty clear... " - Since spin_trylock() is not safe in RT from hard IRQ and NMI disable such usage in lock_trylock and in try_alloc_pages(). " and the patch 2 commit log is clear too... " Since spin_trylock() cannot be used in RT from hard IRQ or NMI it uses lockless link list... " and further in patch 3 commit log... " Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI and fail instantly otherwise, since spin_trylock is not safe from IRQ due to PI issues. " I guess I can reword this particular sentence in patch 1 commit log, but before jumping to an incorrect conclusion please read the whole set. > But maybe, I suppose any BPF program needs to expect failure due to this > being trylock. I just worry some programs will malfunction due to never > succeeding -- and RT getting blamed for this. > > Maybe I worry too much. Humans will find a way to blame BPF and/or RT for all of their problems anyway. Just days ago BPF was blamed in RT for causing IPIs during JIT. Valentin's patches are going to address that but ain't noone has time to explain that continuously. Seriously, though, the number of things that still run in hard irq context in RT is so small that if some tracing BPF prog is attached there it should be using prealloc mode. Full prealloc is still the default for bpf hash map.