Re: [PATCH bpf-next 1/3] bpf: Enable preemption after irq_work_raise() in unit_alloc()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote:
> > On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> From: Hou Tao <houtao1@xxxxxxxxxx>
> >>
> >> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL
> >> unexpectedly because all qp-trie operations were initiated from
> >> bpf syscalls and there was still available free memory. bpf_obj_new()
> >> has the same problem as shown by the following selftest.
> >>
> >> The failure is due to the preemption. irq_work_raise() will invoke
> >> irq_work_claim() first to mark the irq work as pending and then inovke
> >> __irq_work_queue_local() to raise an IPI. So when the current task
> >> which is invoking irq_work_raise() is preempted by other task,
> >> unit_alloc() may return NULL for preemptive task as shown below:
> >>
> >> task A         task B
> >>
> >> unit_alloc()
> >>   // low_watermark = 32
> >>   // free_cnt = 31 after alloc
> >>   irq_work_raise()
> >>     // mark irq work as IRQ_WORK_PENDING
> >>     irq_work_claim()
> >>
> >>                // task B preempts task A
> >>                unit_alloc()
> >>                  // free_cnt = 30 after alloc
> >>                  // irq work is already PENDING,
> >>                  // so just return
> >>                  irq_work_raise()
> >>                // does unit_alloc() 30-times
> >>                ......
> >>                unit_alloc()
> >>                  // free_cnt = 0 before alloc
> >>                  return NULL
> >>
> >> Fix it by invoking preempt_disable_notrace() before allocation and
> >> invoking preempt_enable_notrace() to enable preemption after
> >> irq_work_raise() completes. An alternative fix is to move
> >> local_irq_restore() after the invocation of irq_work_raise(), but it
> >> will enlarge the irq-disabled region. Another feasible fix is to only
> >> disable preemption before invoking irq_work_queue() and enable
> >> preemption after the invocation in irq_work_raise(), but it can't
> >> handle the case when c->low_watermark is 1.
> >>
> >> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
> >> ---
> >>  kernel/bpf/memalloc.c | 8 ++++++++
> >>  1 file changed, 8 insertions(+)
> >>
> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >> index 9c49ae53deaf..83f8913ebb0a 100644
> >> --- a/kernel/bpf/memalloc.c
> >> +++ b/kernel/bpf/memalloc.c
> >> @@ -6,6 +6,7 @@
> >>  #include <linux/irq_work.h>
> >>  #include <linux/bpf_mem_alloc.h>
> >>  #include <linux/memcontrol.h>
> >> +#include <linux/preempt.h>
> >>  #include <asm/local.h>
> >>
> >>  /* Any context (including NMI) BPF specific memory allocator.
> >> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
> >>          * Use per-cpu 'active' counter to order free_list access between
> >>          * unit_alloc/unit_free/bpf_mem_refill.
> >>          */
> >> +       preempt_disable_notrace();
> >>         local_irq_save(flags);
> >>         if (local_inc_return(&c->active) == 1) {
> >>                 llnode = __llist_del_first(&c->free_llist);
> >> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
> >>
> >>         if (cnt < c->low_watermark)
> >>                  (c);
> >> +       /* Enable preemption after the enqueue of irq work completes,
> >> +        * so free_llist may be refilled by irq work before other task
> >> +        * preempts current task.
> >> +        */
> >> +       preempt_enable_notrace();
> > So this helps qp-trie init, since it's doing bpf_mem_alloc from
> > syscall context and helps bpf_obj_new from bpf prog, since prog is
> > non-migrateable, but preemptable. It's not an issue for htab doing
> > during map_update, since
> > it's under htab bucket lock.
> > Let's introduce minimal:
> > /* big comment here explaining the reason of extra preempt disable */
> > static void bpf_memalloc_irq_work_raise(...)
> > {
> >   preempt_disable_notrace();
> >   irq_work_raise();
> >   preempt_enable_notrace();
> > }
> >
> > it will have the same effect, right?
> > .
>
> No. As I said in commit message, when c->low_watermark is 1, the above
> fix doesn't work as shown below:

Yes. I got mark=1 part. I just don't think it's worth the complexity.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux