Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation

Sebastian Sewior <bigeasy@xxxxxxxxxxxxx> · Thu, 12 Dec 2024 16:07:44 +0100

On 2024-12-10 14:06:32 [-0800], Alexei Starovoitov wrote:
> > Is there any reason why GFP_ATOMIC cannot be extended to support new
> > contexts? This allocation mode is already documented to be usable from
> > atomic contexts except from NMI and raw_spinlocks. But is it feasible to
> > extend the current implementation to use only trylock on zone->lock if
> > called from in_nmi() to reduce unexpected failures on contention for
> > existing users?
> 
> No. in_nmi() doesn't help. It's the lack of reentrance of slab and page
> allocator that is an issue.
> The page alloctor might grab zone lock. In !RT it will disable irqs.
> In RT will stay sleepable. Both paths will be calling other
> kernel code including tracepoints, potential kprobes, etc
> and bpf prog may be attached somewhere.
> If it calls alloc_page() it may deadlock on zone->lock.
> pcpu lock is thankfully trylock already.
> So !irqs_disabled() part of preemptible() guarantees that
> zone->lock won't deadlock in !RT.
> And rcu_preempt_depth() case just steers bpf into try lock only path in RT.
> Since there is no way to tell whether it's safe to call
> sleepable spin_lock(&zone->lock).

Oh. You don't need to check rcu_preempt_depth() for that. On PREEMPT_RT
rcu_preempt_depth() is incremented with every spin_lock() because we
need an explicit start of a RCU section (same thing happens with
preempt_disable() spin_lock()). If there is already a RCU section
(rcu_preempt_depth() > 0) you can still try to acquire a spinlock_t and
maybe schedule out/ sleep. That is okay.

But since I see in_nmi(). You can't trylock from NMI on RT. The trylock
part is easy but unlock might need to acquire rt_mutex_base::wait_lock
and worst case is to wake a waiter via wake_up_process().

Sebastian