Re: [PATCH bpf-next v4 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 14 Jan 2025 17:23:20 -0800

On Tue, Jan 14, 2025 at 2:31 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Mon 13-01-25 18:19:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> >
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory. The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
> > Instead, introduce gfpflags_allow_spinning() condition that signals
> > to the allocator that running context is unknown.
> > Then rely on percpu free list of pages to allocate a page.
> > The rmqueue_pcplist() should be able to pop the page from.
> > If it fails (due to IRQ re-entrancy or list being empty) then
> > try_alloc_pages() attempts to spin_trylock zone->lock
> > and refill percpu freelist as normal.
> > BPF program may execute with IRQs disabled and zone->lock is
> > sleeping in RT, so trylock is the only option. In theory we can
> > introduce percpu reentrance counter and increment it every time
> > spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely
> > on it. Even if this cpu is not in page_alloc path the
> > spin_lock_irqsave() is not safe, since BPF prog might be called
> > from tracepoint where preemption is disabled. So trylock only.
> >
> > Note, free_page and memcg are not taught about gfpflags_allow_spinning()
> > condition. The support comes in the next patches.
> >
> > This is a first step towards supporting BPF requirements in SLUB
> > and getting rid of bpf_mem_alloc.
> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> >
> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
>
> LGTM, I am not entirely clear on kmsan_alloc_page part though.

Which part is still confusing?
I hoped the comment below is enough:
   * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
   * is safe in any context. Also zeroing the page is mandatory for
   * BPF use cases.

and once you zoom into:
void kmsan_alloc_page(struct page *page, unsigned int order, gfp_t flags)
{
        bool initialized = (flags & __GFP_ZERO) || !kmsan_enabled;
...
        if (initialized) {
                __memset(page_address(shadow), 0, PAGE_SIZE * pages);
                __memset(page_address(origin), 0, PAGE_SIZE * pages);
                return;
        }

So it's safe to call it and it's necessary to call it when KMSAN is on.

This was the easiest code path to analyze from doesnt-take-spinlocks pov.
I feel the comment is enough.

If/when people want to support !__GFP_ZERO case with KMSAN they would
need to make stack_depot_save() behave in
!gfpflags_allow_spinning() condition.

Since __GFP_ZERO is necessary for the BPF use case I left all
the extra work for the future follow ups.

> As long as that part is correct you can add
> Acked-by: Michal Hocko <mhocko@xxxxxxxx>
>
> Other than that try_alloc_pages_noprof begs some user documentation.
>
> /**
>  * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
>  * @nid - node to allocate from
>  * @order - allocation order size
>  *
>  * Allocates pages of a given order from the given node. This is safe to
>  * call from any context (from atomic, NMI but also reentrant
>  * allocator -> tracepoint -> try_alloc_pages_noprof).
>  * Allocation is best effort and to be expected to fail easily so nobody should
>  * rely on the succeess. Failures are not reported via warn_alloc().
>  *
>  * Return: allocated page or NULL on failure.
>  */

Nicely worded. Will add.

Thanks for all the reviews. Appreciate it!