On Tue, Dec 10, 2024 at 10:39 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > On 12/10/24 03:39, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > Tracing BPF programs execute from tracepoints and kprobes where running > > context is unknown, but they need to request additional memory. > > The prior workarounds were using pre-allocated memory and BPF specific > > freelists to satisfy such allocation requests. Instead, introduce > > __GFP_TRYLOCK flag that makes page allocator accessible from any context. > > It relies on percpu free list of pages that rmqueue_pcplist() should be > > able to pop the page from. If it fails (due to IRQ re-entrancy or list > > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock > > and refill percpu freelist as normal. > > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, > > so trylock is the only option. > > In theory we can introduce percpu reentrance counter and increment it > > every time spin_lock_irqsave(&zone->lock, flags) is used, > > but we cannot rely on it. Even if this cpu is not in page_alloc path > > the spin_lock_irqsave() is not safe, since BPF prog might be called > > from tracepoint where preemption is disabled. So trylock only. > > > > Note, free_page and memcg are not taught about __GFP_TRYLOCK yet. > > The support comes in the next patches. > > > > This is a first step towards supporting BPF requirements in SLUB > > and getting rid of bpf_mem_alloc. > > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/ > > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > I think there might be more non-try spin_locks reachable from page allocations: > > - in reserve_highatomic_pageblock() which I think is reachable unless this > is limited to order-0 Good point. I missed this bit: if (order > 0) alloc_flags |= ALLOC_HIGHATOMIC; In bpf use case it will be called with order == 0 only, but it's better to fool proof it. I will switch to: __GFP_NOMEMALLOC | __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT > - try_to_accept_memory_one() when I studied the code it looked to me that there should be no unaccepted_pages. I think you're saying that there could be unaccepted memory from the previous allocation and trylock attempt just got unlucky to reach that path? What do you think of the following: - cond_accept_memory(zone, order); + cond_accept_memory(zone, order, alloc_flags); /* * Detect whether the number of free pages is below high @@ -7024,7 +7024,8 @@ static inline bool has_unaccepted_memory(void) return static_branch_unlikely(&zones_with_unaccepted_pages); } -static bool cond_accept_memory(struct zone *zone, unsigned int order) +static bool cond_accept_memory(struct zone *zone, unsigned int order, + unsigned int alloc_flags) { long to_accept; bool ret = false; @@ -7032,6 +7033,9 @@ static bool cond_accept_memory(struct zone *zone, unsigned int order) if (!has_unaccepted_memory()) return false; + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + return false; + or is there a better approach? Reading from current->flags the way Matthew proposed? > - as part of post_alloc_hook() in set_page_owner(), stack depot might do > raw_spin_lock_irqsave(), is that one ok? Well, I looked at the stack depot and was tempted to add trylock handling there, but it looked to be a bit dodgy in general and I figured it should be done separately from this set. Like: if (unlikely(can_alloc && !READ_ONCE(new_pool))) { page = alloc_pages(gfp_nested_mask(alloc_flags), followed by: if (in_nmi()) { /* We can never allocate in NMI context. */ WARN_ON_ONCE(can_alloc); that warn is too late. If we were in_nmi and called alloc_pages the kernel might be misbehaving already. > > hope I didn't miss anything else especially in those other debugging hooks > (KASAN etc) I looked through them and could be missing something, of course. kasan usage in alloc_page path seems fine. But for slab I found kasan_quarantine logic which needs a special treatment. Other slab debugging bits pose issues too. The rough idea is to do kmalloc_nolock() / kfree_nolock() that don't call into any pre/post hooks (including slab_free_hook, slab_pre_alloc_hook). kmalloc_nolock() will pretty much call __slab_alloc_node() directly and do basic kasan poison stuff that needs no locks. I will be going over all the paths again, of course. Thanks for the reviews so far!