Hi, On 8/23/2023 9:57 AM, Alexei Starovoitov wrote: > On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> Hi, >> >> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: >>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>>> From: Hou Tao <houtao1@xxxxxxxxxx> >>>> >>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL >>>> unexpectedly because all qp-trie operations were initiated from >>>> bpf syscalls and there was still available free memory. bpf_obj_new() >>>> has the same problem as shown by the following selftest. >>>> >>>> The failure is due to the preemption. irq_work_raise() will invoke >>>> irq_work_claim() first to mark the irq work as pending and then inovke >>>> __irq_work_queue_local() to raise an IPI. So when the current task >>>> which is invoking irq_work_raise() is preempted by other task, >>>> unit_alloc() may return NULL for preemptive task as shown below: >>>> >>>> task A task B >>>> >>>> unit_alloc() >>>> // low_watermark = 32 >>>> // free_cnt = 31 after alloc >>>> irq_work_raise() >>>> // mark irq work as IRQ_WORK_PENDING >>>> irq_work_claim() >>>> >>>> // task B preempts task A >>>> unit_alloc() >>>> // free_cnt = 30 after alloc >>>> // irq work is already PENDING, >>>> // so just return >>>> irq_work_raise() >>>> // does unit_alloc() 30-times >>>> ...... >>>> unit_alloc() >>>> // free_cnt = 0 before alloc >>>> return NULL >>>> >>>> Fix it by invoking preempt_disable_notrace() before allocation and >>>> invoking preempt_enable_notrace() to enable preemption after >>>> irq_work_raise() completes. An alternative fix is to move >>>> local_irq_restore() after the invocation of irq_work_raise(), but it >>>> will enlarge the irq-disabled region. Another feasible fix is to only >>>> disable preemption before invoking irq_work_queue() and enable >>>> preemption after the invocation in irq_work_raise(), but it can't >>>> handle the case when c->low_watermark is 1. >>>> >>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> >>>> --- >>>> kernel/bpf/memalloc.c | 8 ++++++++ >>>> 1 file changed, 8 insertions(+) >>>> >>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>> index 9c49ae53deaf..83f8913ebb0a 100644 >>>> --- a/kernel/bpf/memalloc.c >>>> +++ b/kernel/bpf/memalloc.c >>>> @@ -6,6 +6,7 @@ >>>> #include <linux/irq_work.h> >>>> #include <linux/bpf_mem_alloc.h> >>>> #include <linux/memcontrol.h> >>>> +#include <linux/preempt.h> >>>> #include <asm/local.h> >>>> >>>> /* Any context (including NMI) BPF specific memory allocator. >>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>> * Use per-cpu 'active' counter to order free_list access between >>>> * unit_alloc/unit_free/bpf_mem_refill. >>>> */ >>>> + preempt_disable_notrace(); >>>> local_irq_save(flags); >>>> if (local_inc_return(&c->active) == 1) { >>>> llnode = __llist_del_first(&c->free_llist); >>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>> >>>> if (cnt < c->low_watermark) >>>> (c); >>>> + /* Enable preemption after the enqueue of irq work completes, >>>> + * so free_llist may be refilled by irq work before other task >>>> + * preempts current task. >>>> + */ >>>> + preempt_enable_notrace(); >>> So this helps qp-trie init, since it's doing bpf_mem_alloc from >>> syscall context and helps bpf_obj_new from bpf prog, since prog is >>> non-migrateable, but preemptable. It's not an issue for htab doing >>> during map_update, since >>> it's under htab bucket lock. >>> Let's introduce minimal: >>> /* big comment here explaining the reason of extra preempt disable */ >>> static void bpf_memalloc_irq_work_raise(...) >>> { >>> preempt_disable_notrace(); >>> irq_work_raise(); >>> preempt_enable_notrace(); >>> } >>> >>> it will have the same effect, right? >>> . >> No. As I said in commit message, when c->low_watermark is 1, the above >> fix doesn't work as shown below: > Yes. I got mark=1 part. I just don't think it's worth the complexity. Just find out that for bpf_obj_new() the minimal low_watermark is 2 instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with low_watermark as 2, the above fix may don't work when there are nested preemption: task A (free_cnt = 1 after alloc) -> preempted by task B (free_cnt = 0 after alloc) -> preempted by task C (fail to do allocation). And in my naive understanding of bpf memory allocate, these fixes are simple. Why do you think it will introduce extra complexity ? Do you mean preempt_disable_notrace() could be used to trigger the running of bpf program ? If it is the problem, I think we should fix it instead.