Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 10/20/2022 2:38 AM, sdf@xxxxxxxxxx wrote:
> On 10/19, Hou Tao wrote:
>> From: Hou Tao <houtao1@xxxxxxxxxx>
>
>> A busy irq work is an unfinished irq work and it can be either in the
>> pending state or in the running state. When destroying bpf memory
>> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
>> work is invoked in a per-CPU RT-kthread. It is also possible for kernel
>> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
>> and irq work is inovked in timer interrupt.
>
>> The busy refill_work leads to various issues. The obvious one is that
>> there will be concurrent operations on free_by_rcu and free_list between
>> irq work and memory draining. Another one is call_rcu_in_progress will
>> not be reliable for the checking of pending RCU callback because
>> do_call_rcu() may has not been invoked by irq work. The other is there
>> will be use-after-free if irq work is freed before the callback of
>> irq work is invoked as shown below:
>
>>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>>   #PF: supervisor instruction fetch in kernel mode
>>   #PF: error_code(0x0010) - not-present page
>>   PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
>>   Oops: 0010 [#1] PREEMPT_RT SMP
>>   CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
>>   RIP: 0010:0x0
>>   Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>>   RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
>>   RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
>>   RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
>>   ......
>>   Call Trace:
>>    <TASK>
>>    irq_work_single+0x24/0x60
>>    irq_work_run_list+0x24/0x30
>>    run_irq_workd+0x23/0x30
>>    smpboot_thread_fn+0x203/0x300
>>    kthread+0x126/0x150
>>    ret_from_fork+0x1f/0x30
>>    </TASK>
>
>> Considering the ease of concurrency handling and the short wait time
>> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
>> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
>> the 99th percentile is 10us), just waiting for busy refill_work to
>> complete before memory draining and memory freeing.
>
>> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory
>> allocator.")
>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>> ---
>>   kernel/bpf/memalloc.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 94f0f63443a6..48e606aaacf0 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>           rcu_in_progress = 0;
>>           for_each_possible_cpu(cpu) {
>>               c = per_cpu_ptr(ma->cache, cpu);
>> +            /*
>> +             * refill_work may be unfinished for PREEMPT_RT kernel
>> +             * in which irq work is invoked in a per-CPU RT thread.
>> +             * It is also possible for kernel with
>> +             * arch_irq_work_has_interrupt() being false and irq
>> +             * work is inovked in timer interrupt. So wait for the
>> +             * completion of irq work to ease the handling of
>> +             * concurrency.
>> +             */
>> +            irq_work_sync(&c->refill_work);
>
> Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
> We do have a bunch of them sprinkled already to run alloc/free with
> irqs disabled.
No. As said in the commit message and the comments, irq_work_sync() is needed
for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being
false. And for other kernels, irq_work_sync() doesn't incur any overhead,
because it is  just a simple memory read through irq_work_is_busy() and nothing
else. The reason is the irq work must have been completed when invoking
bpf_mem_alloc_destroy() for these kernels.

void irq_work_sync(struct irq_work *work)
{
       /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */
        /* irq wor*/
        while (irq_work_is_busy(work))
                cpu_relax();
}

>
> I was also trying to see if adding local_irq_save inside drain_mem_cache
> to pair with the ones from refill might work, but waiting for irq to
> finish seems easier...
Disabling hard irq works, but irq_work_sync() is still needed to ensure it is
completed before freeing its memory.
>
> Maybe also move both of these in some new "static void irq_work_wait"
> to make it clear that the PREEMT_RT comment applies to both of them?
>
> Or maybe that helper should do 'for_each_possible_cpu(cpu)
> irq_work_sync(&c->refill_work);'
> in the PREEMPT_RT case so we don't have to call it twice?
drain_mem_cache() is also time consuming somethings, so I think it is better to
interleave irq_work_sync() and drain_mem_cache() to reduce waiting time.

>
>>               drain_mem_cache(c);
>>               rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>           }
>> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>               cc = per_cpu_ptr(ma->caches, cpu);
>>               for (i = 0; i < NUM_CACHES; i++) {
>>                   c = &cc->cache[i];
>> +                irq_work_sync(&c->refill_work);
>>                   drain_mem_cache(c);
>>                   rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>               }
>> -- 
>> 2.29.2
>
> .




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux