On Mon, Aug 29, 2022 at 2:59 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 8/26/22 4:44 AM, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > Tracing BPF programs can attach to kprobe and fentry. Hence they > > run in unknown context where calling plain kmalloc() might not be safe. > > > > Front-end kmalloc() with minimal per-cpu cache of free elements. > > Refill this cache asynchronously from irq_work. > > > > BPF programs always run with migration disabled. > > It's safe to allocate from cache of the current cpu with irqs disabled. > > Free-ing is always done into bucket of the current cpu as well. > > irq_work trims extra free elements from buckets with kfree > > and refills them with kmalloc, so global kmalloc logic takes care > > of freeing objects allocated by one cpu and freed on another. > > > > struct bpf_mem_alloc supports two modes: > > - When size != 0 create kmem_cache and bpf_mem_cache for each cpu. > > This is typical bpf hash map use case when all elements have equal size. > > - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on > > kmalloc/kfree. Max allocation size is 4096 in this case. > > This is bpf_dynptr and bpf_kptr use case. > > > > bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree. > > bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free. > > > > The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general. > > > > Acked-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > --- > > include/linux/bpf_mem_alloc.h | 26 ++ > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/memalloc.c | 476 ++++++++++++++++++++++++++++++++++ > > 3 files changed, 503 insertions(+), 1 deletion(-) > > create mode 100644 include/linux/bpf_mem_alloc.h > > create mode 100644 kernel/bpf/memalloc.c > > > [...] > > +#define NUM_CACHES 11 > > + > > +struct bpf_mem_cache { > > + /* per-cpu list of free objects of size 'unit_size'. > > + * All accesses are done with interrupts disabled and 'active' counter > > + * protection with __llist_add() and __llist_del_first(). > > + */ > > + struct llist_head free_llist; > > + local_t active; > > + > > + /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill > > + * are sequenced by per-cpu 'active' counter. But unit_free() cannot > > + * fail. When 'active' is busy the unit_free() will add an object to > > + * free_llist_extra. > > + */ > > + struct llist_head free_llist_extra; > > + > > + /* kmem_cache != NULL when bpf_mem_alloc was created for specific > > + * element size. > > + */ > > + struct kmem_cache *kmem_cache; > > + struct irq_work refill_work; > > + struct obj_cgroup *objcg; > > + int unit_size; > > + /* count of objects in free_llist */ > > + int free_cnt; > > +}; > > + > > +struct bpf_mem_caches { > > + struct bpf_mem_cache cache[NUM_CACHES]; > > +}; > > + > > Could we now also completely get rid of the current map prealloc infra (pcpu_freelist* > I mean), and replace it with above variant altogether? Would be nice to make it work > for this case, too, and then get rid of percpu_freelist.{h,c} .. it's essentially a > superset wrt functionality iiuc? Eventually it would be possible to get rid of prealloc logic completely, but not so fast. LRU map needs to be converted first. Then a lot of production testing is necessary to gain confidence and make sure we didn't miss any corner cases.