On Wed, Jun 21, 2023 at 08:26:30PM +0800, Hou Tao wrote: > > */ > > - rcu_barrier_tasks_trace(); > > + rcu_barrier(); /* wait for __free_by_rcu() */ > > + rcu_barrier_tasks_trace(); /* wait for __free_rcu() via call_rcu_tasks_trace */ > Using rcu_barrier_tasks_trace and rcu_barrier() is not enough, the > objects in c->free_by_rcu_ttrace may be leaked as shown below. We may > need to add an extra variable to notify __free_by_rcu() to free these > elements directly instead of trying to move it into > waiting_for_gp_ttrace as I did before. Or we can just drain > free_by_rcu_ttrace twice. > > destroy process __free_by_rcu > > llist_del_all(&c->free_by_rcu_ttrace) > > // add to free_by_rcu_ttrace again > llist_add_batch(..., &tgt->free_by_rcu_ttrace) > do_call_rcu_ttrace() > // call_rcu_ttrace_in_progress is 1, so > xchg return 1 > // and it will not be moved to > waiting_for_gp_ttrace > > atomic_xchg(&c->call_rcu_ttrace_in_progress, 1) > > // got 1 > atomic_read(&c->call_rcu_ttrace_in_progress) The formatting is off, but I think I got the idea. Yes. It's a race. > > rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); > > + rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > I got a oops in rcu_tasks_invoke_cbs() during stressing test and it > seems we should do atomic_read(&call_rcu_in_progress) first, then do > atomic_read(&call_rcu_ttrace_in_progress) to fix the problem. And to yes. it's a race. As you find out yourself changing the order won't fix it. > The introduction of c->tgt make the destroy procedure more complicated. > Even with the proposed fix above, there is still oops in > rcu_tasks_invoke_cbs() and I think it happens as follows: Right. That's another issue. Please send bench htab test and your special stress test, so we can have a common ground to reason about. Also please share your bench htab numbers before/after. I'm thinking to fix the races in the following way. Could you please test it with your stress test? The idea is to set 'draining' first everywhere that it will make all rcu callbacks a nop. Then drain all link lists. At this point nothing races with them. And then wait for single rcu_barrier_tasks_trace() that will make sure all callbcaks done. At this point the only thing they will do is if (c->draining) goto out; The barriers are needed to make 'c' access not uaf. ... >From e20782160166d4327c76b57af160c4973396e0d0 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov <ast@xxxxxxxxxx> Date: Wed, 21 Jun 2023 20:11:33 -0700 Subject: [PATCH bpf-next] bpf: Fix races. Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> --- kernel/bpf/memalloc.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 4d1002e7b4b5..75c553b15deb 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -99,6 +99,7 @@ struct bpf_mem_cache { int low_watermark, high_watermark, batch; int percpu_size; struct bpf_mem_cache *tgt; + bool draining; /* list of objects to be freed after RCU GP */ struct llist_head free_by_rcu; @@ -264,7 +265,10 @@ static void __free_rcu(struct rcu_head *head) { struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace); + if (unlikely(c->draining)) + goto out; free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); +out: atomic_set(&c->call_rcu_ttrace_in_progress, 0); } @@ -353,8 +357,11 @@ static void __free_by_rcu(struct rcu_head *head) { struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu); struct bpf_mem_cache *tgt = c->tgt; - struct llist_node *llnode = llist_del_all(&c->waiting_for_gp); + struct llist_node *llnode; + if (unlikely(c->draining)) + goto out; + llnode = llist_del_all(&c->waiting_for_gp); if (!llnode) goto out; @@ -568,10 +575,9 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma) * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by * using rcu_trace_implies_rcu_gp() as well. */ - rcu_barrier(); /* wait for __free_by_rcu() */ - rcu_barrier_tasks_trace(); /* wait for __free_rcu() via call_rcu_tasks_trace */ + rcu_barrier_tasks_trace(); if (!rcu_trace_implies_rcu_gp()) - rcu_barrier(); /* wait for __free_rcu() via call_rcu */ + rcu_barrier(); free_mem_alloc_no_barrier(ma); } @@ -616,6 +622,10 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) if (ma->cache) { rcu_in_progress = 0; + for_each_possible_cpu(cpu) { + c = per_cpu_ptr(ma->cache, cpu); + c->draining = true; + } for_each_possible_cpu(cpu) { c = per_cpu_ptr(ma->cache, cpu); /* @@ -639,6 +649,13 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) } if (ma->caches) { rcu_in_progress = 0; + for_each_possible_cpu(cpu) { + cc = per_cpu_ptr(ma->caches, cpu); + for (i = 0; i < NUM_CACHES; i++) { + c = &cc->cache[i]; + c->draining = true; + } + } for_each_possible_cpu(cpu) { cc = per_cpu_ptr(ma->caches, cpu); for (i = 0; i < NUM_CACHES; i++) { -- 2.34.1