On Tue, Aug 22, 2023 at 9:39 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 8/23/2023 9:57 AM, Alexei Starovoitov wrote: > > On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > >> Hi, > >> > >> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: > >>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > >>>> From: Hou Tao <houtao1@xxxxxxxxxx> > >>>> > >>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL > >>>> unexpectedly because all qp-trie operations were initiated from > >>>> bpf syscalls and there was still available free memory. bpf_obj_new() > >>>> has the same problem as shown by the following selftest. > >>>> > >>>> The failure is due to the preemption. irq_work_raise() will invoke > >>>> irq_work_claim() first to mark the irq work as pending and then inovke > >>>> __irq_work_queue_local() to raise an IPI. So when the current task > >>>> which is invoking irq_work_raise() is preempted by other task, > >>>> unit_alloc() may return NULL for preemptive task as shown below: > >>>> > >>>> task A task B > >>>> > >>>> unit_alloc() > >>>> // low_watermark = 32 > >>>> // free_cnt = 31 after alloc > >>>> irq_work_raise() > >>>> // mark irq work as IRQ_WORK_PENDING > >>>> irq_work_claim() > >>>> > >>>> // task B preempts task A > >>>> unit_alloc() > >>>> // free_cnt = 30 after alloc > >>>> // irq work is already PENDING, > >>>> // so just return > >>>> irq_work_raise() > >>>> // does unit_alloc() 30-times > >>>> ...... > >>>> unit_alloc() > >>>> // free_cnt = 0 before alloc > >>>> return NULL > >>>> > >>>> Fix it by invoking preempt_disable_notrace() before allocation and > >>>> invoking preempt_enable_notrace() to enable preemption after > >>>> irq_work_raise() completes. An alternative fix is to move > >>>> local_irq_restore() after the invocation of irq_work_raise(), but it > >>>> will enlarge the irq-disabled region. Another feasible fix is to only > >>>> disable preemption before invoking irq_work_queue() and enable > >>>> preemption after the invocation in irq_work_raise(), but it can't > >>>> handle the case when c->low_watermark is 1. > >>>> > >>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > >>>> --- > >>>> kernel/bpf/memalloc.c | 8 ++++++++ > >>>> 1 file changed, 8 insertions(+) > >>>> > >>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >>>> index 9c49ae53deaf..83f8913ebb0a 100644 > >>>> --- a/kernel/bpf/memalloc.c > >>>> +++ b/kernel/bpf/memalloc.c > >>>> @@ -6,6 +6,7 @@ > >>>> #include <linux/irq_work.h> > >>>> #include <linux/bpf_mem_alloc.h> > >>>> #include <linux/memcontrol.h> > >>>> +#include <linux/preempt.h> > >>>> #include <asm/local.h> > >>>> > >>>> /* Any context (including NMI) BPF specific memory allocator. > >>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > >>>> * Use per-cpu 'active' counter to order free_list access between > >>>> * unit_alloc/unit_free/bpf_mem_refill. > >>>> */ > >>>> + preempt_disable_notrace(); > >>>> local_irq_save(flags); > >>>> if (local_inc_return(&c->active) == 1) { > >>>> llnode = __llist_del_first(&c->free_llist); > >>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > >>>> > >>>> if (cnt < c->low_watermark) > >>>> (c); > >>>> + /* Enable preemption after the enqueue of irq work completes, > >>>> + * so free_llist may be refilled by irq work before other task > >>>> + * preempts current task. > >>>> + */ > >>>> + preempt_enable_notrace(); > >>> So this helps qp-trie init, since it's doing bpf_mem_alloc from > >>> syscall context and helps bpf_obj_new from bpf prog, since prog is > >>> non-migrateable, but preemptable. It's not an issue for htab doing > >>> during map_update, since > >>> it's under htab bucket lock. > >>> Let's introduce minimal: > >>> /* big comment here explaining the reason of extra preempt disable */ > >>> static void bpf_memalloc_irq_work_raise(...) > >>> { > >>> preempt_disable_notrace(); > >>> irq_work_raise(); > >>> preempt_enable_notrace(); > >>> } > >>> > >>> it will have the same effect, right? > >>> . > >> No. As I said in commit message, when c->low_watermark is 1, the above > >> fix doesn't work as shown below: > > Yes. I got mark=1 part. I just don't think it's worth the complexity. > > Just find out that for bpf_obj_new() the minimal low_watermark is 2 > instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with > low_watermark as 2, the above fix may don't work when there are nested > preemption: task A (free_cnt = 1 after alloc) -> preempted by task B > (free_cnt = 0 after alloc) -> preempted by task C (fail to do > allocation). And in my naive understanding of bpf memory allocate, these > fixes are simple. Why do you think it will introduce extra complexity ? > Do you mean preempt_disable_notrace() could be used to trigger the > running of bpf program ? If it is the problem, I think we should fix it > instead. I'm not worried about recursive calls from _notrace(). That shouldn't be possible. I'm just saying that disabling preemption around irq_work_raise() helps a bit while disable around the whole unit_alloc/free is a snake oil. bpf prog could be running in irq disabled context and preempt disabled unit_alloc vs irq_work_raise won't make any difference. Both will return NULL. Same with batched htab update. It will hit NULL too. So from my pov you're trying to fix something that is not fixable. Batched alloc will fail. The users have to use bpf_ma differently. In places where they cannot afford alloc failures they need to pre-allocate. We were planning to introduce bpf_obj_new() that does kmalloc when bpf prog is sleepable. Then bpf prog can stash such manually pre-allocated object and use it later.