Hi, On 8/25/2023 11:48 AM, Alexei Starovoitov wrote: > On Thu, Aug 24, 2023 at 7:07 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> Hi, >> >> On 8/24/2023 12:33 AM, Alexei Starovoitov wrote: >>> On Tue, Aug 22, 2023 at 9:39 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>>> Hi, >>>> >>>> On 8/23/2023 9:57 AM, Alexei Starovoitov wrote: >>>>> On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>>>>> Hi, >>>>>> >>>>>> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: >>>>>>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>>>>>>> From: Hou Tao <houtao1@xxxxxxxxxx> >>>>>>>> >>>>>>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL >>>>>>>> unexpectedly because all qp-trie operations were initiated from >>>>>>>> bpf syscalls and there was still available free memory. bpf_obj_new() >>>>>>>> has the same problem as shown by the following selftest. >>>>>>>> >>>>>>>> The failure is due to the preemption. irq_work_raise() will invoke >>>>>>>> irq_work_claim() first to mark the irq work as pending and then inovke >>>>>>>> __irq_work_queue_local() to raise an IPI. So when the current task >>>>>>>> which is invoking irq_work_raise() is preempted by other task, >>>>>>>> unit_alloc() may return NULL for preemptive task as shown below: >>>>>>>> >>>>>>>> task A task B >>>>>>>> >>>>>>>> unit_alloc() >>>>>>>> // low_watermark = 32 >>>>>>>> // free_cnt = 31 after alloc >>>>>>>> irq_work_raise() >>>>>>>> // mark irq work as IRQ_WORK_PENDING >>>>>>>> irq_work_claim() >>>>>>>> >>>>>>>> // task B preempts task A >>>>>>>> unit_alloc() >>>>>>>> // free_cnt = 30 after alloc >>>>>>>> // irq work is already PENDING, >>>>>>>> // so just return >>>>>>>> irq_work_raise() >>>>>>>> // does unit_alloc() 30-times >>>>>>>> ...... >>>>>>>> unit_alloc() >>>>>>>> // free_cnt = 0 before alloc >>>>>>>> return NULL >>>>>>>> >>>>>>>> Fix it by invoking preempt_disable_notrace() before allocation and >>>>>>>> invoking preempt_enable_notrace() to enable preemption after >>>>>>>> irq_work_raise() completes. An alternative fix is to move >>>>>>>> local_irq_restore() after the invocation of irq_work_raise(), but it >>>>>>>> will enlarge the irq-disabled region. Another feasible fix is to only >>>>>>>> disable preemption before invoking irq_work_queue() and enable >>>>>>>> preemption after the invocation in irq_work_raise(), but it can't >>>>>>>> handle the case when c->low_watermark is 1. >>>>>>>> >>>>>>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> >>>>>>>> --- >>>>>>>> kernel/bpf/memalloc.c | 8 ++++++++ >>>>>>>> 1 file changed, 8 insertions(+) >>>>>>>> >>>>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>>>>>> index 9c49ae53deaf..83f8913ebb0a 100644 >>>>>>>> --- a/kernel/bpf/memalloc.c >>>>>>>> +++ b/kernel/bpf/memalloc.c >>>>>>>> @@ -6,6 +6,7 @@ >>>>>>>> #include <linux/irq_work.h> >>>>>>>> #include <linux/bpf_mem_alloc.h> >>>>>>>> #include <linux/memcontrol.h> >>>>>>>> +#include <linux/preempt.h> >>>>>>>> #include <asm/local.h> >>>>>>>> >>>>>>>> /* Any context (including NMI) BPF specific memory allocator. >>>>>>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>>>>>> * Use per-cpu 'active' counter to order free_list access between >>>>>>>> * unit_alloc/unit_free/bpf_mem_refill. >>>>>>>> */ >>>>>>>> + preempt_disable_notrace(); >>>>>>>> local_irq_save(flags); >>>>>>>> if (local_inc_return(&c->active) == 1) { >>>>>>>> llnode = __llist_del_first(&c->free_llist); >>>>>>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>>>>>> >>>>>>>> if (cnt < c->low_watermark) >>>>>>>> (c); >>>>>>>> + /* Enable preemption after the enqueue of irq work completes, >>>>>>>> + * so free_llist may be refilled by irq work before other task >>>>>>>> + * preempts current task. >>>>>>>> + */ >>>>>>>> + preempt_enable_notrace(); >>>>>>> So this helps qp-trie init, since it's doing bpf_mem_alloc from >>>>>>> syscall context and helps bpf_obj_new from bpf prog, since prog is >>>>>>> non-migrateable, but preemptable. It's not an issue for htab doing >>>>>>> during map_update, since >>>>>>> it's under htab bucket lock. >>>>>>> Let's introduce minimal: >>>>>>> /* big comment here explaining the reason of extra preempt disable */ >>>>>>> static void bpf_memalloc_irq_work_raise(...) >>>>>>> { >>>>>>> preempt_disable_notrace(); >>>>>>> irq_work_raise(); >>>>>>> preempt_enable_notrace(); >>>>>>> } >>>>>>> >>>>>>> it will have the same effect, right? >>>>>>> . >>>>>> No. As I said in commit message, when c->low_watermark is 1, the above >>>>>> fix doesn't work as shown below: >>>>> Yes. I got mark=1 part. I just don't think it's worth the complexity. >>>> Just find out that for bpf_obj_new() the minimal low_watermark is 2 >>>> instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with >>>> low_watermark as 2, the above fix may don't work when there are nested >>>> preemption: task A (free_cnt = 1 after alloc) -> preempted by task B >>>> (free_cnt = 0 after alloc) -> preempted by task C (fail to do >>>> allocation). And in my naive understanding of bpf memory allocate, these >>>> fixes are simple. Why do you think it will introduce extra complexity ? >>>> Do you mean preempt_disable_notrace() could be used to trigger the >>>> running of bpf program ? If it is the problem, I think we should fix it >>>> instead. >>> I'm not worried about recursive calls from _notrace(). That shouldn't >>> be possible. >> OK >>> I'm just saying that disabling preemption around irq_work_raise() helps a bit >>> while disable around the whole unit_alloc/free is a snake oil. >>> bpf prog could be running in irq disabled context and preempt disabled >>> unit_alloc vs irq_work_raise won't make any difference. Both will return NULL. >>> Same with batched htab update. It will hit NULL too. >>> So from my pov you're trying to fix something that is not fixable. >> The patch set didn't try to fix the problem for all possible context, >> especially the irq disable context. It just tries to fix the ENOMEM >> problem for process context which is the major context. I still think >> disabling preemption around the whole unit_alloc/free is much solider >> than just do that for irq_work_raise() (e.g., for the nested preemption >> case). But if you have a strong preference for only disabling preemption >> for irq_work_raise(), I will post v2 to do that. > In process ctx the preempt_disable/enable across unit_alloc will keep > asking kernel to consider preemption on every unit_alloc call > which can be a lot. > If I'm reading the code correctly preempt_schedule() is quite heavy. > Doing it every unit_alloc is a performance concern. > > Could you try the following: > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 9c49ae53deaf..ee8262f58c5a 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -442,7 +442,10 @@ static void bpf_mem_refill(struct irq_work *work) > > static void notrace irq_work_raise(struct bpf_mem_cache *c) > { > - irq_work_queue(&c->refill_work); > + if (!irq_work_queue(&c->refill_work)) { > + preempt_disable_notrace(); > + preempt_enable_notrace(); > + } > } > > The idea that it will ask for resched if preemptible. > will it address the issue you're seeing? > > . No. It didn't work. If you are concerning about the overhead of preempt_enabled_notrace(), we could use local_irq_save() and local_irq_restore() instead.