Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator

sdf@xxxxxxxxxx · Wed, 19 Oct 2022 11:38:44 -0700

On 10/19, Hou Tao wrote:
From: Hou Tao <houtao1@xxxxxxxxxx>

A busy irq work is an unfinished irq work and it can be either in the
pending state or in the running state. When destroying bpf memory
allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
work is invoked in a per-CPU RT-kthread. It is also possible for kernel
with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
and irq work is inovked in timer interrupt.

The busy refill_work leads to various issues. The obvious one is that
there will be concurrent operations on free_by_rcu and free_list between
irq work and memory draining. Another one is call_rcu_in_progress will
not be reliable for the checking of pending RCU callback because
do_call_rcu() may has not been invoked by irq work. The other is there
will be use-after-free if irq work is freed before the callback of
irq work is invoked as shown below:

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor instruction fetch in kernel mode
  #PF: error_code(0x0010) - not-present page
  PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
  Oops: 0010 [#1] PREEMPT_RT SMP
  CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:0x0
  Code: Unable to access opcode bytes at 0xffffffffffffffd6.
  RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
  RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
  ......
  Call Trace:
   <TASK>
   irq_work_single+0x24/0x60
   irq_work_run_list+0x24/0x30
   run_irq_workd+0x23/0x30
   smpboot_thread_fn+0x203/0x300
   kthread+0x126/0x150
   ret_from_fork+0x1f/0x30
   </TASK>

Considering the ease of concurrency handling and the short wait time
used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
the 99th percentile is 10us), just waiting for busy refill_work to
complete before memory draining and memory freeing.

Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory  
allocator.")
Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
---
  kernel/bpf/memalloc.c | 11 +++++++++++
  1 file changed, 11 insertions(+)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 94f0f63443a6..48e606aaacf0 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
  		rcu_in_progress = 0;
  		for_each_possible_cpu(cpu) {
  			c = per_cpu_ptr(ma->cache, cpu);
+			/*
+			 * refill_work may be unfinished for PREEMPT_RT kernel
+			 * in which irq work is invoked in a per-CPU RT thread.
+			 * It is also possible for kernel with
+			 * arch_irq_work_has_interrupt() being false and irq
+			 * work is inovked in timer interrupt. So wait for the
+			 * completion of irq work to ease the handling of
+			 * concurrency.
+			 */
+			irq_work_sync(&c->refill_work);

Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
We do have a bunch of them sprinkled already to run alloc/free with
irqs disabled.

I was also trying to see if adding local_irq_save inside drain_mem_cache
to pair with the ones from refill might work, but waiting for irq to
finish seems easier...

Maybe also move both of these in some new "static void irq_work_wait"
to make it clear that the PREEMT_RT comment applies to both of them?

Or maybe that helper should do 'for_each_possible_cpu(cpu)  
irq_work_sync(&c->refill_work);'
in the PREEMPT_RT case so we don't have to call it twice?

  			drain_mem_cache(c);
  			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
  		}
@@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
  			cc = per_cpu_ptr(ma->caches, cpu);
  			for (i = 0; i < NUM_CACHES; i++) {
  				c = &cc->cache[i];
+				irq_work_sync(&c->refill_work);
  				drain_mem_cache(c);
  				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
  			}
--
2.29.2