Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 28 Jun 2022 10:03:43 -0700

On Tue, Jun 28, 2022 at 03:57:54PM +0200, Christoph Lameter wrote:
> On Mon, 27 Jun 2022, Alexei Starovoitov wrote:
> 
> > On Mon, Jun 27, 2022 at 5:17 PM Christoph Lameter <cl@xxxxxxxxx> wrote:
> > >
> > > > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > >
> > > > Introduce any context BPF specific memory allocator.
> > > >
> > > > Tracing BPF programs can attach to kprobe and fentry. Hence they
> > > > run in unknown context where calling plain kmalloc() might not be safe.
> > > > Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> > > > Refill this cache asynchronously from irq_work.
> > >
> > > GFP_ATOMIC etc is not going to work for you?
> >
> > slab_alloc_node->slab_alloc->local_lock_irqsave
> > kprobe -> bpf prog -> slab_alloc_node -> deadlock.
> > In other words, the slow path of slab allocator takes locks.
> 
> That is a relatively new feature due to RT logic support. without RT this
> would be a simple irq disable.

Not just RT.
It's a slow path:
        if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
            unlikely(!object || !slab || !node_match(slab, node))) {
              local_unlock_irqrestore(&s->cpu_slab->lock,...);
and that's not the only lock in there.
new_slab->allocate_slab... alloc_pages grabbing more locks.

> Generally doing slab allocation  while debugging slab allocation is not
> something that can work. Can we exempt RT locks/irqsave or slab alloc from
> BPF tracing?

People started doing lock profiling with bpf back in 2017.
People do rcu profiling now and attaching bpf progs to all kinds of low level
kernel internals: page alloc, etc.

> I would assume that other key items of kernel logic will have similar
> issues.

We're _not_ asking for any changes from mm/slab side.
Things were working all these years. We're making them more efficient now
by getting rid of 'lets prealloc everything' approach.

> > Which makes it unsafe to use from tracing bpf progs.
> > That's why we preallocated all elements in bpf maps,
> > so there are no calls to mm or rcu logic.
> > bpf specific allocator cannot use locks at all.
> > try_lock approach could have been used in alloc path,
> > but free path cannot fail with try_lock.
> > Hence the algorithm in this patch is purely lockless.
> > bpf prog can attach to spin_unlock_irqrestore and
> > safely do bpf_mem_alloc.
> 
> That is generally safe unless you get into reetrance issues with memory
> allocation.

Right. Generic slab/mm/page_alloc/rcu are not ready for reentrance and
are not safe from NMI either.
That's why we're added all kinds of safey mechanisms in bpf layers.

> Which begs the question:
> 
> What happens if I try to use BPF to trace *your* shiny new memory

'shiny and new' is overstatement. It's a trivial lock less freelist layer
on top of kmalloc. Please read the patch.

> allocation functions in the BPF logic like bpf_mem_alloc? How do you stop
> that from happening?

here is the comment in the patch:
/* notrace is necessary here and in other functions to make sure
 * bpf programs cannot attach to them and cause llist corruptions.
 */