On Wed, Oct 19, 2022 at 6:08 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 10/20/2022 2:38 AM, sdf@xxxxxxxxxx wrote: > > On 10/19, Hou Tao wrote: > >> From: Hou Tao <houtao1@xxxxxxxxxx> > > > >> A busy irq work is an unfinished irq work and it can be either in the > >> pending state or in the running state. When destroying bpf memory > >> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq > >> work is invoked in a per-CPU RT-kthread. It is also possible for kernel > >> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host) > >> and irq work is inovked in timer interrupt. > > > >> The busy refill_work leads to various issues. The obvious one is that > >> there will be concurrent operations on free_by_rcu and free_list between > >> irq work and memory draining. Another one is call_rcu_in_progress will > >> not be reliable for the checking of pending RCU callback because > >> do_call_rcu() may has not been invoked by irq work. The other is there > >> will be use-after-free if irq work is freed before the callback of > >> irq work is invoked as shown below: > > > >> BUG: kernel NULL pointer dereference, address: 0000000000000000 > >> #PF: supervisor instruction fetch in kernel mode > >> #PF: error_code(0x0010) - not-present page > >> PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0 > >> Oops: 0010 [#1] PREEMPT_RT SMP > >> CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1 > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) > >> RIP: 0010:0x0 > >> Code: Unable to access opcode bytes at 0xffffffffffffffd6. > >> RSP: 0018:ffffadc080293e78 EFLAGS: 00010286 > >> RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000 > >> RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388 > >> ...... > >> Call Trace: > >> <TASK> > >> irq_work_single+0x24/0x60 > >> irq_work_run_list+0x24/0x30 > >> run_irq_workd+0x23/0x30 > >> smpboot_thread_fn+0x203/0x300 > >> kthread+0x126/0x150 > >> ret_from_fork+0x1f/0x30 > >> </TASK> > > > >> Considering the ease of concurrency handling and the short wait time > >> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on > >> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and > >> the 99th percentile is 10us), just waiting for busy refill_work to > >> complete before memory draining and memory freeing. > > > >> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory > >> allocator.") > >> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > >> --- > >> kernel/bpf/memalloc.c | 11 +++++++++++ > >> 1 file changed, 11 insertions(+) > > > >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >> index 94f0f63443a6..48e606aaacf0 100644 > >> --- a/kernel/bpf/memalloc.c > >> +++ b/kernel/bpf/memalloc.c > >> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > >> rcu_in_progress = 0; > >> for_each_possible_cpu(cpu) { > >> c = per_cpu_ptr(ma->cache, cpu); > >> + /* > >> + * refill_work may be unfinished for PREEMPT_RT kernel > >> + * in which irq work is invoked in a per-CPU RT thread. > >> + * It is also possible for kernel with > >> + * arch_irq_work_has_interrupt() being false and irq > >> + * work is inovked in timer interrupt. So wait for the > >> + * completion of irq work to ease the handling of > >> + * concurrency. > >> + */ > >> + irq_work_sync(&c->refill_work); > > > > Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ? > > We do have a bunch of them sprinkled already to run alloc/free with > > irqs disabled. > No. As said in the commit message and the comments, irq_work_sync() is needed > for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being > false. And for other kernels, irq_work_sync() doesn't incur any overhead, > because it is just a simple memory read through irq_work_is_busy() and nothing > else. The reason is the irq work must have been completed when invoking > bpf_mem_alloc_destroy() for these kernels. > > void irq_work_sync(struct irq_work *work) > { > /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */ > /* irq wor*/ > while (irq_work_is_busy(work)) > cpu_relax(); > } I see, thanks for clarifying! I was so carried away with that PREEMPT_RT that I missed the fact that arch_irq_work_has_interrupt is a separate thing. Agreed that doing irq_work_sync won't hurt in a non-preempt/non-has_interrupt case. In this case, can you still do a respin and fix the spelling issue in the comment? You can slap my acked-by for the v2: Acked-by: Stanislav Fomichev <sdf@xxxxxxxxxx> s/work is inovked in timer interrupt. So wait for the/... invoked .../ > > > > I was also trying to see if adding local_irq_save inside drain_mem_cache > > to pair with the ones from refill might work, but waiting for irq to > > finish seems easier... > Disabling hard irq works, but irq_work_sync() is still needed to ensure it is > completed before freeing its memory. > > > > Maybe also move both of these in some new "static void irq_work_wait" > > to make it clear that the PREEMT_RT comment applies to both of them? > > > > Or maybe that helper should do 'for_each_possible_cpu(cpu) > > irq_work_sync(&c->refill_work);' > > in the PREEMPT_RT case so we don't have to call it twice? > drain_mem_cache() is also time consuming somethings, so I think it is better to > interleave irq_work_sync() and drain_mem_cache() to reduce waiting time. > > > > >> drain_mem_cache(c); > >> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > >> } > >> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > >> cc = per_cpu_ptr(ma->caches, cpu); > >> for (i = 0; i < NUM_CACHES; i++) { > >> c = &cc->cache[i]; > >> + irq_work_sync(&c->refill_work); > >> drain_mem_cache(c); > >> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > >> } > >> -- > >> 2.29.2 > > > > . >