On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > From: Hou Tao <houtao1@xxxxxxxxxx> > > When doing stress test for qp-trie, bpf_mem_alloc() returned NULL > unexpectedly because all qp-trie operations were initiated from > bpf syscalls and there was still available free memory. bpf_obj_new() > has the same problem as shown by the following selftest. > > The failure is due to the preemption. irq_work_raise() will invoke > irq_work_claim() first to mark the irq work as pending and then inovke > __irq_work_queue_local() to raise an IPI. So when the current task > which is invoking irq_work_raise() is preempted by other task, > unit_alloc() may return NULL for preemptive task as shown below: > > task A task B > > unit_alloc() > // low_watermark = 32 > // free_cnt = 31 after alloc > irq_work_raise() > // mark irq work as IRQ_WORK_PENDING > irq_work_claim() > > // task B preempts task A > unit_alloc() > // free_cnt = 30 after alloc > // irq work is already PENDING, > // so just return > irq_work_raise() > // does unit_alloc() 30-times > ...... > unit_alloc() > // free_cnt = 0 before alloc > return NULL > > Fix it by invoking preempt_disable_notrace() before allocation and > invoking preempt_enable_notrace() to enable preemption after > irq_work_raise() completes. An alternative fix is to move > local_irq_restore() after the invocation of irq_work_raise(), but it > will enlarge the irq-disabled region. Another feasible fix is to only > disable preemption before invoking irq_work_queue() and enable > preemption after the invocation in irq_work_raise(), but it can't > handle the case when c->low_watermark is 1. > > Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > --- > kernel/bpf/memalloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 9c49ae53deaf..83f8913ebb0a 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -6,6 +6,7 @@ > #include <linux/irq_work.h> > #include <linux/bpf_mem_alloc.h> > #include <linux/memcontrol.h> > +#include <linux/preempt.h> > #include <asm/local.h> > > /* Any context (including NMI) BPF specific memory allocator. > @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > * Use per-cpu 'active' counter to order free_list access between > * unit_alloc/unit_free/bpf_mem_refill. > */ > + preempt_disable_notrace(); > local_irq_save(flags); > if (local_inc_return(&c->active) == 1) { > llnode = __llist_del_first(&c->free_llist); > @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > > if (cnt < c->low_watermark) > (c); > + /* Enable preemption after the enqueue of irq work completes, > + * so free_llist may be refilled by irq work before other task > + * preempts current task. > + */ > + preempt_enable_notrace(); So this helps qp-trie init, since it's doing bpf_mem_alloc from syscall context and helps bpf_obj_new from bpf prog, since prog is non-migrateable, but preemptable. It's not an issue for htab doing during map_update, since it's under htab bucket lock. Let's introduce minimal: /* big comment here explaining the reason of extra preempt disable */ static void bpf_memalloc_irq_work_raise(...) { preempt_disable_notrace(); irq_work_raise(); preempt_enable_notrace(); } it will have the same effect, right?