From: Hou Tao <houtao1@xxxxxxxxxx> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL unexpectedly because all qp-trie operations were initiated from bpf syscalls and there was still available free memory. bpf_obj_new() has the same problem as shown by the following selftest. The failure is due to the preemption. irq_work_raise() will invoke irq_work_claim() first to mark the irq work as pending and then inovke __irq_work_queue_local() to raise an IPI. So when the current task which is invoking irq_work_raise() is preempted by other task, unit_alloc() may return NULL for preemptive task as shown below: task A task B unit_alloc() // low_watermark = 32 // free_cnt = 31 after alloc irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 30 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 30-times ...... unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by invoking preempt_disable_notrace() before allocation and invoking preempt_enable_notrace() to enable preemption after irq_work_raise() completes. An alternative fix is to move local_irq_restore() after the invocation of irq_work_raise(), but it will enlarge the irq-disabled region. Another feasible fix is to only disable preemption before invoking irq_work_queue() and enable preemption after the invocation in irq_work_raise(), but it can't handle the case when c->low_watermark is 1. Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> --- kernel/bpf/memalloc.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 9c49ae53deaf..83f8913ebb0a 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -6,6 +6,7 @@ #include <linux/irq_work.h> #include <linux/bpf_mem_alloc.h> #include <linux/memcontrol.h> +#include <linux/preempt.h> #include <asm/local.h> /* Any context (including NMI) BPF specific memory allocator. @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) * Use per-cpu 'active' counter to order free_list access between * unit_alloc/unit_free/bpf_mem_refill. */ + preempt_disable_notrace(); local_irq_save(flags); if (local_inc_return(&c->active) == 1) { llnode = __llist_del_first(&c->free_llist); @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) if (cnt < c->low_watermark) irq_work_raise(c); + /* Enable preemption after the enqueue of irq work completes, + * so free_llist may be refilled by irq work before other task + * preempts current task. + */ + preempt_enable_notrace(); + return llnode; } -- 2.29.2