On 12/12/24 03:14, Alexei Starovoitov wrote: > On Wed, Dec 11, 2024 at 12:39 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: >> >> On 12/10/24 22:53, Alexei Starovoitov wrote: >> > On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior >> > <bigeasy@xxxxxxxxxxxxx> wrote: >> >> >> >> On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote: >> >> > From: Alexei Starovoitov <ast@xxxxxxxxxx> >> >> > >> >> > Tracing BPF programs execute from tracepoints and kprobes where running >> >> > context is unknown, but they need to request additional memory. >> >> > The prior workarounds were using pre-allocated memory and BPF specific >> >> > freelists to satisfy such allocation requests. Instead, introduce >> >> > __GFP_TRYLOCK flag that makes page allocator accessible from any context. >> >> > It relies on percpu free list of pages that rmqueue_pcplist() should be >> >> > able to pop the page from. If it fails (due to IRQ re-entrancy or list >> >> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock >> >> > and refill percpu freelist as normal. >> >> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, >> >> > so trylock is the only option. >> >> >> >> The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF >> >> where it is not known how much memory will be needed and what the >> >> calling context is. >> > >> > Exactly. >> > >> >> I hope it does not spread across the kernel where >> >> people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they >> >> learn that this does not work, add this flag to the mix to make it work >> >> without spending some time on reworking it. >> > >> > We can call it __GFP_BPF to discourage any other usage, >> > but that seems like an odd "solution" to code review problem. >> >> Could we perhaps not expose the flag to public headers at all, and keep it >> only as an internal detail of try_alloc_pages_noprof()? > > public headers? I mean it could be (with some work) defined only in e.g. mm/internal.h, which the flag printing code would then need to include. > To pass additional bit via gfp flags into alloc_pages > gfp_types.h has to be touched. Ah right, try_alloc_pages_noprof() would need to move to page_alloc.c instead of being static inline in the header. > If you mean moving try_alloc_pages() into mm/page_alloc.c and > adding another argument to __alloc_pages_noprof then it's not pretty. > It has 'gfp_t gfp' argument. It should to be used to pass the intent. __GFP_TRYLOCK could be visible in page_alloc.c to do this, but not ouside mm code. > We don't have to add GFP_TRYLOCK at all if we go with > memalloc_nolock_save() approach. I have doubts about that idea. We recently rejected PF_MEMALLOC_NORECLAIM because it could lead to allocations nested in that scope failing and they might not expect it. Scoped trylock would have even higher chance of failing. I think here we need to pass the flag as part of gfp flags only within nested allocations (for metadata or debugging) within the slab/page allocator itself, which we already mostly do. The harder problem is not missing any place where it should affect taking a lock, and a PF_ flag won't help with that (as we can't want all locking functions to look at it). Maybe it could help with lockdep helping us find locks that we missed, but I'm sure lockdep could be made to track the trylock scope even without a PF flag? > So I started looking at it, > but immediately hit trouble with bits. > There are 5 bits left in PF_ and 3 already used for mm needs. > That doesn't look sustainable long term. > How about we alias nolock concept with PF_MEMALLOC_PIN ? > > As far as I could trace PF_MEMALLOC_PIN clears GFP_MOVABLE and nothing else. > > The same bit plus lack of __GFP_KSWAPD_RECLAIM in gfp flags > would mean nolock mode in alloc_pages, > while PF_MEMALLOC_PIN alone would mean nolock in free_pages > and deeper inside memcg paths and such. > > thoughts? too hacky?