On Tue, Jun 28, 2022 at 03:57:54PM +0200, Christoph Lameter wrote: > On Mon, 27 Jun 2022, Alexei Starovoitov wrote: > > > On Mon, Jun 27, 2022 at 5:17 PM Christoph Lameter <cl@xxxxxxxxx> wrote: > > > > > > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > > > > > Introduce any context BPF specific memory allocator. > > > > > > > > Tracing BPF programs can attach to kprobe and fentry. Hence they > > > > run in unknown context where calling plain kmalloc() might not be safe. > > > > Front-end kmalloc() with per-cpu per-bucket cache of free elements. > > > > Refill this cache asynchronously from irq_work. > > > > > > GFP_ATOMIC etc is not going to work for you? > > > > slab_alloc_node->slab_alloc->local_lock_irqsave > > kprobe -> bpf prog -> slab_alloc_node -> deadlock. > > In other words, the slow path of slab allocator takes locks. > > That is a relatively new feature due to RT logic support. without RT this > would be a simple irq disable. Not just RT. It's a slow path: if (IS_ENABLED(CONFIG_PREEMPT_RT) || unlikely(!object || !slab || !node_match(slab, node))) { local_unlock_irqrestore(&s->cpu_slab->lock,...); and that's not the only lock in there. new_slab->allocate_slab... alloc_pages grabbing more locks. > Generally doing slab allocation while debugging slab allocation is not > something that can work. Can we exempt RT locks/irqsave or slab alloc from > BPF tracing? People started doing lock profiling with bpf back in 2017. People do rcu profiling now and attaching bpf progs to all kinds of low level kernel internals: page alloc, etc. > I would assume that other key items of kernel logic will have similar > issues. We're _not_ asking for any changes from mm/slab side. Things were working all these years. We're making them more efficient now by getting rid of 'lets prealloc everything' approach. > > Which makes it unsafe to use from tracing bpf progs. > > That's why we preallocated all elements in bpf maps, > > so there are no calls to mm or rcu logic. > > bpf specific allocator cannot use locks at all. > > try_lock approach could have been used in alloc path, > > but free path cannot fail with try_lock. > > Hence the algorithm in this patch is purely lockless. > > bpf prog can attach to spin_unlock_irqrestore and > > safely do bpf_mem_alloc. > > That is generally safe unless you get into reetrance issues with memory > allocation. Right. Generic slab/mm/page_alloc/rcu are not ready for reentrance and are not safe from NMI either. That's why we're added all kinds of safey mechanisms in bpf layers. > Which begs the question: > > What happens if I try to use BPF to trace *your* shiny new memory 'shiny and new' is overstatement. It's a trivial lock less freelist layer on top of kmalloc. Please read the patch. > allocation functions in the BPF logic like bpf_mem_alloc? How do you stop > that from happening? here is the comment in the patch: /* notrace is necessary here and in other functions to make sure * bpf programs cannot attach to them and cause llist corruptions. */