Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation

Michal Hocko <mhocko@xxxxxxxx> · Wed, 11 Dec 2024 11:19:12 +0100

On Tue 10-12-24 14:06:32, Alexei Starovoitov wrote:
> On Tue, Dec 10, 2024 at 1:05 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >
> > On Tue 10-12-24 05:31:30, Matthew Wilcox wrote:
> > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote:
> > > > +   if (preemptible() && !rcu_preempt_depth())
> > > > +           return alloc_pages_node_noprof(nid,
> > > > +                                          GFP_NOWAIT | __GFP_ZERO,
> > > > +                                          order);
> > > > +   return alloc_pages_node_noprof(nid,
> > > > +                                  __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO,
> > > > +                                  order);
> > >
> > > [...]
> > >
> > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
> > > >      * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
> > > >      */
> > > >     alloc_flags |= (__force int)
> > > > -           (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> > > > +           (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK));
> > >
> > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this.
> > > I was originally wondering if this wasn't a memalloc_nolock_save() /
> > > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore),
> > > but I wonder if we can simply do:
> > >
> > >       if (!preemptible() || rcu_preempt_depth())
> > >               alloc_flags |= ALLOC_TRYLOCK;
> >
> > preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that
> > __GFP_TRYLOCK is not really a preferred way to go forward. For 3
> > reasons.
> >
> > First I do not really like the name as it tells what it does rather than
> > how it should be used. This is a general pattern of many gfp flags
> > unfotrunatelly and historically it has turned out error prone. If a gfp
> > flag is really needed then something like __GFP_ANY_CONTEXT should be
> > used.  If the current implementation requires to use try_lock for
> > zone->lock or other changes is not an implementation detail but the user
> > should have a clear understanding that allocation is allowed from any
> > context (NMI, IRQ or otherwise atomic contexts).
> 
> __GFP_ANY_CONTEXT would make sense if we wanted to make it available
> for all kernel users. In this case I agree with Sebastian.
> This is bpf specific feature, since it doesn't know the context.
> All other kernel users should pick GFP_KERNEL or ATOMIC or NOWAIT.
> Exposing GFP_ANY_CONTEXT to all may lead to sloppy code in drivers
> and elsewhere.

I do not think we want a single user special allocation mode. Not only
there is no way to enforce this to remain BPF special feature, it is
also not really a good idea to have a single user feature in the
allocator.

> > Is there any reason why GFP_ATOMIC cannot be extended to support new
> > contexts? This allocation mode is already documented to be usable from
> > atomic contexts except from NMI and raw_spinlocks. But is it feasible to
> > extend the current implementation to use only trylock on zone->lock if
> > called from in_nmi() to reduce unexpected failures on contention for
> > existing users?
> 
> No. in_nmi() doesn't help. It's the lack of reentrance of slab and page
> allocator that is an issue.
> The page alloctor might grab zone lock. In !RT it will disable irqs.
> In RT will stay sleepable. Both paths will be calling other
> kernel code including tracepoints, potential kprobes, etc
> and bpf prog may be attached somewhere.
> If it calls alloc_page() it may deadlock on zone->lock.
> pcpu lock is thankfully trylock already.
> So !irqs_disabled() part of preemptible() guarantees that
> zone->lock won't deadlock in !RT.
> And rcu_preempt_depth() case just steers bpf into try lock only path in RT.
> Since there is no way to tell whether it's safe to call
> sleepable spin_lock(&zone->lock).

OK I see!

> > We
> > already have a precence in form of __alloc_pages_bulk which is a special
> > case allocator mode living outside of the page allocator path. It seems
> > that it covers most of your requirements except the fallback to the
> > regular allocation path AFAICS. Is this something you could piggy back
> > on?
> 
> __alloc_pages_bulk() has all the same issues. It takes locks.
> Also it doesn't support GFP_ACCOUNT which is a show stopper.
> All bpf allocations are going through memcg.

OK, this requirement was not clear until I've reached later patches in
the series (now).
-- 
Michal Hocko
SUSE Labs