Re: [PATCH bpf-next 1/3] bpf: Enable preemption after irq_work_raise() in unit_alloc()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 8/24/2023 12:33 AM, Alexei Starovoitov wrote:
> On Tue, Aug 22, 2023 at 9:39 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> Hi,
>>
>> On 8/23/2023 9:57 AM, Alexei Starovoitov wrote:
>>> On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote:
>>>>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>>>>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>>>>>
>>>>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL
>>>>>> unexpectedly because all qp-trie operations were initiated from
>>>>>> bpf syscalls and there was still available free memory. bpf_obj_new()
>>>>>> has the same problem as shown by the following selftest.
>>>>>>
>>>>>> The failure is due to the preemption. irq_work_raise() will invoke
>>>>>> irq_work_claim() first to mark the irq work as pending and then inovke
>>>>>> __irq_work_queue_local() to raise an IPI. So when the current task
>>>>>> which is invoking irq_work_raise() is preempted by other task,
>>>>>> unit_alloc() may return NULL for preemptive task as shown below:
>>>>>>
>>>>>> task A         task B
>>>>>>
>>>>>> unit_alloc()
>>>>>>   // low_watermark = 32
>>>>>>   // free_cnt = 31 after alloc
>>>>>>   irq_work_raise()
>>>>>>     // mark irq work as IRQ_WORK_PENDING
>>>>>>     irq_work_claim()
>>>>>>
>>>>>>                // task B preempts task A
>>>>>>                unit_alloc()
>>>>>>                  // free_cnt = 30 after alloc
>>>>>>                  // irq work is already PENDING,
>>>>>>                  // so just return
>>>>>>                  irq_work_raise()
>>>>>>                // does unit_alloc() 30-times
>>>>>>                ......
>>>>>>                unit_alloc()
>>>>>>                  // free_cnt = 0 before alloc
>>>>>>                  return NULL
>>>>>>
>>>>>> Fix it by invoking preempt_disable_notrace() before allocation and
>>>>>> invoking preempt_enable_notrace() to enable preemption after
>>>>>> irq_work_raise() completes. An alternative fix is to move
>>>>>> local_irq_restore() after the invocation of irq_work_raise(), but it
>>>>>> will enlarge the irq-disabled region. Another feasible fix is to only
>>>>>> disable preemption before invoking irq_work_queue() and enable
>>>>>> preemption after the invocation in irq_work_raise(), but it can't
>>>>>> handle the case when c->low_watermark is 1.
>>>>>>
>>>>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>>>>>> ---
>>>>>>  kernel/bpf/memalloc.c | 8 ++++++++
>>>>>>  1 file changed, 8 insertions(+)
>>>>>>
>>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>>>> index 9c49ae53deaf..83f8913ebb0a 100644
>>>>>> --- a/kernel/bpf/memalloc.c
>>>>>> +++ b/kernel/bpf/memalloc.c
>>>>>> @@ -6,6 +6,7 @@
>>>>>>  #include <linux/irq_work.h>
>>>>>>  #include <linux/bpf_mem_alloc.h>
>>>>>>  #include <linux/memcontrol.h>
>>>>>> +#include <linux/preempt.h>
>>>>>>  #include <asm/local.h>
>>>>>>
>>>>>>  /* Any context (including NMI) BPF specific memory allocator.
>>>>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>>>>>          * Use per-cpu 'active' counter to order free_list access between
>>>>>>          * unit_alloc/unit_free/bpf_mem_refill.
>>>>>>          */
>>>>>> +       preempt_disable_notrace();
>>>>>>         local_irq_save(flags);
>>>>>>         if (local_inc_return(&c->active) == 1) {
>>>>>>                 llnode = __llist_del_first(&c->free_llist);
>>>>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>>>>>
>>>>>>         if (cnt < c->low_watermark)
>>>>>>                  (c);
>>>>>> +       /* Enable preemption after the enqueue of irq work completes,
>>>>>> +        * so free_llist may be refilled by irq work before other task
>>>>>> +        * preempts current task.
>>>>>> +        */
>>>>>> +       preempt_enable_notrace();
>>>>> So this helps qp-trie init, since it's doing bpf_mem_alloc from
>>>>> syscall context and helps bpf_obj_new from bpf prog, since prog is
>>>>> non-migrateable, but preemptable. It's not an issue for htab doing
>>>>> during map_update, since
>>>>> it's under htab bucket lock.
>>>>> Let's introduce minimal:
>>>>> /* big comment here explaining the reason of extra preempt disable */
>>>>> static void bpf_memalloc_irq_work_raise(...)
>>>>> {
>>>>>   preempt_disable_notrace();
>>>>>   irq_work_raise();
>>>>>   preempt_enable_notrace();
>>>>> }
>>>>>
>>>>> it will have the same effect, right?
>>>>> .
>>>> No. As I said in commit message, when c->low_watermark is 1, the above
>>>> fix doesn't work as shown below:
>>> Yes. I got mark=1 part. I just don't think it's worth the complexity.
>> Just find out that for bpf_obj_new() the minimal low_watermark is 2
>> instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with
>> low_watermark as 2, the above fix may don't work when there are nested
>> preemption: task A (free_cnt = 1 after alloc) -> preempted by task B
>> (free_cnt = 0 after alloc) -> preempted by task C (fail to do
>> allocation). And in my naive understanding of bpf memory allocate, these
>> fixes are simple. Why do you think it will introduce extra complexity ?
>> Do you mean preempt_disable_notrace() could be used to trigger the
>> running of bpf program ? If it is the problem, I think we should fix it
>> instead.
> I'm not worried about recursive calls from _notrace(). That shouldn't
> be possible.

OK
> I'm just saying that disabling preemption around irq_work_raise() helps a bit
> while disable around the whole unit_alloc/free is a snake oil.
> bpf prog could be running in irq disabled context and preempt disabled
> unit_alloc vs irq_work_raise won't make any difference. Both will return NULL.
> Same with batched htab update. It will hit NULL too.
> So from my pov you're trying to fix something that is not fixable.

The patch set didn't try to fix the problem for all possible context,
especially the irq disable context. It just tries to fix the ENOMEM
problem for process context which is the major context. I still think
disabling preemption around the whole unit_alloc/free is much solider
than just do that for irq_work_raise() (e.g., for the nested preemption
case). But if you have a strong preference for only disabling preemption
for irq_work_raise(), I will post v2 to do that.
> Batched alloc will fail. The users have to use bpf_ma differently.

At least under x86-64 and process context, I don't think batched alloc
will fail as shown in the selftest in which it tries to allocate 512
4096-sized objects in a bpf_loop. And it also didn't fail when I changed
bpf_mem_free() to bpf_mem_free_rcu() in bpf_obj_drop() to disable
immediate reuse.
> In places where they cannot afford alloc failures they need to pre-allocate.
> We were planning to introduce bpf_obj_new() that does kmalloc when
> bpf prog is sleepable. Then bpf prog can stash such manually pre-allocated
> object and use it later.
I also thought about the problem when written the patch set. Maybe we
could introduce bpf_mem_preload() or something similar to do that. For
non-sleepable bpf program, maybe we need to introduce some userspace
APIs to do pre-allocation.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux