Re: [PATCH bpf-next 1/3] bpf: Enable preemption after irq_work_raise() in unit_alloc()

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 24 Aug 2023 20:48:31 -0700

On Thu, Aug 24, 2023 at 7:07 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 8/24/2023 12:33 AM, Alexei Starovoitov wrote:
> > On Tue, Aug 22, 2023 at 9:39 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> Hi,
> >>
> >> On 8/23/2023 9:57 AM, Alexei Starovoitov wrote:
> >>> On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >>>> Hi,
> >>>>
> >>>> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote:
> >>>>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >>>>>> From: Hou Tao <houtao1@xxxxxxxxxx>
> >>>>>>
> >>>>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL
> >>>>>> unexpectedly because all qp-trie operations were initiated from
> >>>>>> bpf syscalls and there was still available free memory. bpf_obj_new()
> >>>>>> has the same problem as shown by the following selftest.
> >>>>>>
> >>>>>> The failure is due to the preemption. irq_work_raise() will invoke
> >>>>>> irq_work_claim() first to mark the irq work as pending and then inovke
> >>>>>> __irq_work_queue_local() to raise an IPI. So when the current task
> >>>>>> which is invoking irq_work_raise() is preempted by other task,
> >>>>>> unit_alloc() may return NULL for preemptive task as shown below:
> >>>>>>
> >>>>>> task A         task B
> >>>>>>
> >>>>>> unit_alloc()
> >>>>>>   // low_watermark = 32
> >>>>>>   // free_cnt = 31 after alloc
> >>>>>>   irq_work_raise()
> >>>>>>     // mark irq work as IRQ_WORK_PENDING
> >>>>>>     irq_work_claim()
> >>>>>>
> >>>>>>                // task B preempts task A
> >>>>>>                unit_alloc()
> >>>>>>                  // free_cnt = 30 after alloc
> >>>>>>                  // irq work is already PENDING,
> >>>>>>                  // so just return
> >>>>>>                  irq_work_raise()
> >>>>>>                // does unit_alloc() 30-times
> >>>>>>                ......
> >>>>>>                unit_alloc()
> >>>>>>                  // free_cnt = 0 before alloc
> >>>>>>                  return NULL
> >>>>>>
> >>>>>> Fix it by invoking preempt_disable_notrace() before allocation and
> >>>>>> invoking preempt_enable_notrace() to enable preemption after
> >>>>>> irq_work_raise() completes. An alternative fix is to move
> >>>>>> local_irq_restore() after the invocation of irq_work_raise(), but it
> >>>>>> will enlarge the irq-disabled region. Another feasible fix is to only
> >>>>>> disable preemption before invoking irq_work_queue() and enable
> >>>>>> preemption after the invocation in irq_work_raise(), but it can't
> >>>>>> handle the case when c->low_watermark is 1.
> >>>>>>
> >>>>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
> >>>>>> ---
> >>>>>>  kernel/bpf/memalloc.c | 8 ++++++++
> >>>>>>  1 file changed, 8 insertions(+)
> >>>>>>
> >>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >>>>>> index 9c49ae53deaf..83f8913ebb0a 100644
> >>>>>> --- a/kernel/bpf/memalloc.c
> >>>>>> +++ b/kernel/bpf/memalloc.c
> >>>>>> @@ -6,6 +6,7 @@
> >>>>>>  #include <linux/irq_work.h>
> >>>>>>  #include <linux/bpf_mem_alloc.h>
> >>>>>>  #include <linux/memcontrol.h>
> >>>>>> +#include <linux/preempt.h>
> >>>>>>  #include <asm/local.h>
> >>>>>>
> >>>>>>  /* Any context (including NMI) BPF specific memory allocator.
> >>>>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
> >>>>>>          * Use per-cpu 'active' counter to order free_list access between
> >>>>>>          * unit_alloc/unit_free/bpf_mem_refill.
> >>>>>>          */
> >>>>>> +       preempt_disable_notrace();
> >>>>>>         local_irq_save(flags);
> >>>>>>         if (local_inc_return(&c->active) == 1) {
> >>>>>>                 llnode = __llist_del_first(&c->free_llist);
> >>>>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
> >>>>>>
> >>>>>>         if (cnt < c->low_watermark)
> >>>>>>                  (c);
> >>>>>> +       /* Enable preemption after the enqueue of irq work completes,
> >>>>>> +        * so free_llist may be refilled by irq work before other task
> >>>>>> +        * preempts current task.
> >>>>>> +        */
> >>>>>> +       preempt_enable_notrace();
> >>>>> So this helps qp-trie init, since it's doing bpf_mem_alloc from
> >>>>> syscall context and helps bpf_obj_new from bpf prog, since prog is
> >>>>> non-migrateable, but preemptable. It's not an issue for htab doing
> >>>>> during map_update, since
> >>>>> it's under htab bucket lock.
> >>>>> Let's introduce minimal:
> >>>>> /* big comment here explaining the reason of extra preempt disable */
> >>>>> static void bpf_memalloc_irq_work_raise(...)
> >>>>> {
> >>>>>   preempt_disable_notrace();
> >>>>>   irq_work_raise();
> >>>>>   preempt_enable_notrace();
> >>>>> }
> >>>>>
> >>>>> it will have the same effect, right?
> >>>>> .
> >>>> No. As I said in commit message, when c->low_watermark is 1, the above
> >>>> fix doesn't work as shown below:
> >>> Yes. I got mark=1 part. I just don't think it's worth the complexity.
> >> Just find out that for bpf_obj_new() the minimal low_watermark is 2
> >> instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with
> >> low_watermark as 2, the above fix may don't work when there are nested
> >> preemption: task A (free_cnt = 1 after alloc) -> preempted by task B
> >> (free_cnt = 0 after alloc) -> preempted by task C (fail to do
> >> allocation). And in my naive understanding of bpf memory allocate, these
> >> fixes are simple. Why do you think it will introduce extra complexity ?
> >> Do you mean preempt_disable_notrace() could be used to trigger the
> >> running of bpf program ? If it is the problem, I think we should fix it
> >> instead.
> > I'm not worried about recursive calls from _notrace(). That shouldn't
> > be possible.
>
> OK
> > I'm just saying that disabling preemption around irq_work_raise() helps a bit
> > while disable around the whole unit_alloc/free is a snake oil.
> > bpf prog could be running in irq disabled context and preempt disabled
> > unit_alloc vs irq_work_raise won't make any difference. Both will return NULL.
> > Same with batched htab update. It will hit NULL too.
> > So from my pov you're trying to fix something that is not fixable.
>
> The patch set didn't try to fix the problem for all possible context,
> especially the irq disable context. It just tries to fix the ENOMEM
> problem for process context which is the major context. I still think
> disabling preemption around the whole unit_alloc/free is much solider
> than just do that for irq_work_raise() (e.g., for the nested preemption
> case). But if you have a strong preference for only disabling preemption
> for irq_work_raise(), I will post v2 to do that.

In process ctx the preempt_disable/enable across unit_alloc will keep
asking kernel to consider preemption on every unit_alloc call
which can be a lot.
If I'm reading the code correctly preempt_schedule() is quite heavy.
Doing it every unit_alloc is a performance concern.

Could you try the following:

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 9c49ae53deaf..ee8262f58c5a 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -442,7 +442,10 @@ static void bpf_mem_refill(struct irq_work *work)

 static void notrace irq_work_raise(struct bpf_mem_cache *c)
 {
-       irq_work_queue(&c->refill_work);
+       if (!irq_work_queue(&c->refill_work)) {
+               preempt_disable_notrace();
+               preempt_enable_notrace();
+       }
 }

The idea that it will ask for resched if preemptible.
will it address the issue you're seeing?