Hi, On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: > On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> From: Hou Tao <houtao1@xxxxxxxxxx> >> >> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL >> unexpectedly because all qp-trie operations were initiated from >> bpf syscalls and there was still available free memory. bpf_obj_new() >> has the same problem as shown by the following selftest. >> >> The failure is due to the preemption. irq_work_raise() will invoke >> irq_work_claim() first to mark the irq work as pending and then inovke >> __irq_work_queue_local() to raise an IPI. So when the current task >> which is invoking irq_work_raise() is preempted by other task, >> unit_alloc() may return NULL for preemptive task as shown below: >> >> task A task B >> >> unit_alloc() >> // low_watermark = 32 >> // free_cnt = 31 after alloc >> irq_work_raise() >> // mark irq work as IRQ_WORK_PENDING >> irq_work_claim() >> >> // task B preempts task A >> unit_alloc() >> // free_cnt = 30 after alloc >> // irq work is already PENDING, >> // so just return >> irq_work_raise() >> // does unit_alloc() 30-times >> ...... >> unit_alloc() >> // free_cnt = 0 before alloc >> return NULL >> >> Fix it by invoking preempt_disable_notrace() before allocation and >> invoking preempt_enable_notrace() to enable preemption after >> irq_work_raise() completes. An alternative fix is to move >> local_irq_restore() after the invocation of irq_work_raise(), but it >> will enlarge the irq-disabled region. Another feasible fix is to only >> disable preemption before invoking irq_work_queue() and enable >> preemption after the invocation in irq_work_raise(), but it can't >> handle the case when c->low_watermark is 1. >> >> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> >> --- >> kernel/bpf/memalloc.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 9c49ae53deaf..83f8913ebb0a 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -6,6 +6,7 @@ >> #include <linux/irq_work.h> >> #include <linux/bpf_mem_alloc.h> >> #include <linux/memcontrol.h> >> +#include <linux/preempt.h> >> #include <asm/local.h> >> >> /* Any context (including NMI) BPF specific memory allocator. >> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >> * Use per-cpu 'active' counter to order free_list access between >> * unit_alloc/unit_free/bpf_mem_refill. >> */ >> + preempt_disable_notrace(); >> local_irq_save(flags); >> if (local_inc_return(&c->active) == 1) { >> llnode = __llist_del_first(&c->free_llist); >> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >> >> if (cnt < c->low_watermark) >> (c); >> + /* Enable preemption after the enqueue of irq work completes, >> + * so free_llist may be refilled by irq work before other task >> + * preempts current task. >> + */ >> + preempt_enable_notrace(); > So this helps qp-trie init, since it's doing bpf_mem_alloc from > syscall context and helps bpf_obj_new from bpf prog, since prog is > non-migrateable, but preemptable. It's not an issue for htab doing > during map_update, since > it's under htab bucket lock. > Let's introduce minimal: > /* big comment here explaining the reason of extra preempt disable */ > static void bpf_memalloc_irq_work_raise(...) > { > preempt_disable_notrace(); > irq_work_raise(); > preempt_enable_notrace(); > } > > it will have the same effect, right? > . No. As I said in commit message, when c->low_watermark is 1, the above fix doesn't work as shown below: task A task B unit_alloc() // low_watermark = 1 // free_cnt = 0 after alloc // task B preempts task A unit_alloc() // free_cnt = 0 before alloc return NULL irq_work_raise()