On Tue 10-12-24 14:06:32, Alexei Starovoitov wrote: > On Tue, Dec 10, 2024 at 1:05 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Tue 10-12-24 05:31:30, Matthew Wilcox wrote: > > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote: > > > > + if (preemptible() && !rcu_preempt_depth()) > > > > + return alloc_pages_node_noprof(nid, > > > > + GFP_NOWAIT | __GFP_ZERO, > > > > + order); > > > > + return alloc_pages_node_noprof(nid, > > > > + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, > > > > + order); > > > > > > [...] > > > > > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) > > > > * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). > > > > */ > > > > alloc_flags |= (__force int) > > > > - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); > > > > + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); > > > > > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this. > > > I was originally wondering if this wasn't a memalloc_nolock_save() / > > > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore), > > > but I wonder if we can simply do: > > > > > > if (!preemptible() || rcu_preempt_depth()) > > > alloc_flags |= ALLOC_TRYLOCK; > > > > preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that > > __GFP_TRYLOCK is not really a preferred way to go forward. For 3 > > reasons. > > > > First I do not really like the name as it tells what it does rather than > > how it should be used. This is a general pattern of many gfp flags > > unfotrunatelly and historically it has turned out error prone. If a gfp > > flag is really needed then something like __GFP_ANY_CONTEXT should be > > used. If the current implementation requires to use try_lock for > > zone->lock or other changes is not an implementation detail but the user > > should have a clear understanding that allocation is allowed from any > > context (NMI, IRQ or otherwise atomic contexts). > > __GFP_ANY_CONTEXT would make sense if we wanted to make it available > for all kernel users. In this case I agree with Sebastian. > This is bpf specific feature, since it doesn't know the context. > All other kernel users should pick GFP_KERNEL or ATOMIC or NOWAIT. > Exposing GFP_ANY_CONTEXT to all may lead to sloppy code in drivers > and elsewhere. I do not think we want a single user special allocation mode. Not only there is no way to enforce this to remain BPF special feature, it is also not really a good idea to have a single user feature in the allocator. > > Is there any reason why GFP_ATOMIC cannot be extended to support new > > contexts? This allocation mode is already documented to be usable from > > atomic contexts except from NMI and raw_spinlocks. But is it feasible to > > extend the current implementation to use only trylock on zone->lock if > > called from in_nmi() to reduce unexpected failures on contention for > > existing users? > > No. in_nmi() doesn't help. It's the lack of reentrance of slab and page > allocator that is an issue. > The page alloctor might grab zone lock. In !RT it will disable irqs. > In RT will stay sleepable. Both paths will be calling other > kernel code including tracepoints, potential kprobes, etc > and bpf prog may be attached somewhere. > If it calls alloc_page() it may deadlock on zone->lock. > pcpu lock is thankfully trylock already. > So !irqs_disabled() part of preemptible() guarantees that > zone->lock won't deadlock in !RT. > And rcu_preempt_depth() case just steers bpf into try lock only path in RT. > Since there is no way to tell whether it's safe to call > sleepable spin_lock(&zone->lock). OK I see! > > We > > already have a precence in form of __alloc_pages_bulk which is a special > > case allocator mode living outside of the page allocator path. It seems > > that it covers most of your requirements except the fallback to the > > regular allocation path AFAICS. Is this something you could piggy back > > on? > > __alloc_pages_bulk() has all the same issues. It takes locks. > Also it doesn't support GFP_ACCOUNT which is a show stopper. > All bpf allocations are going through memcg. OK, this requirement was not clear until I've reached later patches in the series (now). -- Michal Hocko SUSE Labs