Re: [PATCH bpf-next 1/3] bpf: Enable preemption after irq_work_raise() in unit_alloc()

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Wed, 23 Aug 2023 12:39:38 +0800

Hi,

On 8/23/2023 9:57 AM, Alexei Starovoitov wrote:
> On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> Hi,
>>
>> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote:
>>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>>>
>>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL
>>>> unexpectedly because all qp-trie operations were initiated from
>>>> bpf syscalls and there was still available free memory. bpf_obj_new()
>>>> has the same problem as shown by the following selftest.
>>>>
>>>> The failure is due to the preemption. irq_work_raise() will invoke
>>>> irq_work_claim() first to mark the irq work as pending and then inovke
>>>> __irq_work_queue_local() to raise an IPI. So when the current task
>>>> which is invoking irq_work_raise() is preempted by other task,
>>>> unit_alloc() may return NULL for preemptive task as shown below:
>>>>
>>>> task A         task B
>>>>
>>>> unit_alloc()
>>>>   // low_watermark = 32
>>>>   // free_cnt = 31 after alloc
>>>>   irq_work_raise()
>>>>     // mark irq work as IRQ_WORK_PENDING
>>>>     irq_work_claim()
>>>>
>>>>                // task B preempts task A
>>>>                unit_alloc()
>>>>                  // free_cnt = 30 after alloc
>>>>                  // irq work is already PENDING,
>>>>                  // so just return
>>>>                  irq_work_raise()
>>>>                // does unit_alloc() 30-times
>>>>                ......
>>>>                unit_alloc()
>>>>                  // free_cnt = 0 before alloc
>>>>                  return NULL
>>>>
>>>> Fix it by invoking preempt_disable_notrace() before allocation and
>>>> invoking preempt_enable_notrace() to enable preemption after
>>>> irq_work_raise() completes. An alternative fix is to move
>>>> local_irq_restore() after the invocation of irq_work_raise(), but it
>>>> will enlarge the irq-disabled region. Another feasible fix is to only
>>>> disable preemption before invoking irq_work_queue() and enable
>>>> preemption after the invocation in irq_work_raise(), but it can't
>>>> handle the case when c->low_watermark is 1.
>>>>
>>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>>>> ---
>>>>  kernel/bpf/memalloc.c | 8 ++++++++
>>>>  1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>> index 9c49ae53deaf..83f8913ebb0a 100644
>>>> --- a/kernel/bpf/memalloc.c
>>>> +++ b/kernel/bpf/memalloc.c
>>>> @@ -6,6 +6,7 @@
>>>>  #include <linux/irq_work.h>
>>>>  #include <linux/bpf_mem_alloc.h>
>>>>  #include <linux/memcontrol.h>
>>>> +#include <linux/preempt.h>
>>>>  #include <asm/local.h>
>>>>
>>>>  /* Any context (including NMI) BPF specific memory allocator.
>>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>>>          * Use per-cpu 'active' counter to order free_list access between
>>>>          * unit_alloc/unit_free/bpf_mem_refill.
>>>>          */
>>>> +       preempt_disable_notrace();
>>>>         local_irq_save(flags);
>>>>         if (local_inc_return(&c->active) == 1) {
>>>>                 llnode = __llist_del_first(&c->free_llist);
>>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>>>
>>>>         if (cnt < c->low_watermark)
>>>>                  (c);
>>>> +       /* Enable preemption after the enqueue of irq work completes,
>>>> +        * so free_llist may be refilled by irq work before other task
>>>> +        * preempts current task.
>>>> +        */
>>>> +       preempt_enable_notrace();
>>> So this helps qp-trie init, since it's doing bpf_mem_alloc from
>>> syscall context and helps bpf_obj_new from bpf prog, since prog is
>>> non-migrateable, but preemptable. It's not an issue for htab doing
>>> during map_update, since
>>> it's under htab bucket lock.
>>> Let's introduce minimal:
>>> /* big comment here explaining the reason of extra preempt disable */
>>> static void bpf_memalloc_irq_work_raise(...)
>>> {
>>>   preempt_disable_notrace();
>>>   irq_work_raise();
>>>   preempt_enable_notrace();
>>> }
>>>
>>> it will have the same effect, right?
>>> .
>> No. As I said in commit message, when c->low_watermark is 1, the above
>> fix doesn't work as shown below:
> Yes. I got mark=1 part. I just don't think it's worth the complexity.

Just find out that for bpf_obj_new() the minimal low_watermark is 2
instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with
low_watermark as 2, the above fix may don't work when there are nested
preemption: task A (free_cnt = 1 after alloc) -> preempted by task B
(free_cnt = 0 after alloc) -> preempted by task C (fail to do
allocation). And in my naive understanding of bpf memory allocate, these
fixes are simple. Why do you think it will introduce extra complexity ?
Do you mean preempt_disable_notrace() could be used to trigger the
running of bpf program ? If it is the problem, I think we should fix it
instead.