Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.

Alexei Starovoitov <ast@xxxxxxxx> · Tue, 27 Jun 2023 17:59:54 -0700

On 6/26/23 12:16 AM, Hou Tao wrote:
Hi,

On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
Hi,

On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
From: Alexei Starovoitov <ast@xxxxxxxxxx>

alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().

Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
---
  kernel/bpf/memalloc.c | 9 +++++++++
  1 file changed, 9 insertions(+)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 692a9a30c1dc..666917c16e87 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
       if (i >= cnt)
               return;

+     for (; i < cnt; i++) {
+             obj = llist_del_first(&c->waiting_for_gp_ttrace);
After allowing to reuse elements from waiting_for_gp_ttrace, there may
be concurrent llist_del_first() and llist_del_all() as shown below and
llist_del_first() is not safe because the elements freed from free_rcu()
could be reused immediately and head->first may be added back to
c0->waiting_for_gp_ttrace by other process.

// c0
alloc_bulk()
     llist_del_first(&c->waiting_for_gp_ttrace)

// c1->tgt = c0
free_rcu()
     llist_del_all(&c->waiting_for_gp_ttrace)
I'm still thinking about how to fix the other issues you've reported,
but this one, I believe, is fine.
Are you basing 'not safe' on a comment?
Why xchg(&head->first, NULL); on one cpu and
try_cmpxchg(&head->first, &entry, next);
is unsafe?
No, I didn't reason it only based on the comment in llist.h. The reason
I though it was not safe because llist_del_first() may have ABA problem
as pointed by you early, because after __free_rcu(), the elements on
waiting_for_gp_ttrace would be available immediately and
waiting_for_gp_ttrace->first may be reused and then added back to
waiting_for_gp_ttrace->first again as shown below.

// c0->waiting_for_gp_ttrace
A -> B -> C -> nil
  
P1:
alloc_bulk(c0)
     llist_del_first(&c0->waiting_for_gp_ttrace)
         entry = A
         next = B
  
     P2: __free_rcu
         // A/B/C is freed back to slab
         // c0->waiting_for_gp_ttrace->first = NULL
         free_all(llist_del_all(c0))
  
     P3: alloc_bulk(c1)
         // got A
         __kmalloc()
  
     // free A (from c1), ..., last free X (allocated from c0)
     P3: unit_free(c1)
         // the last freed element X is from c0
         c1->tgt = c0
         c1->free_llist->first -> X -> Y -> ... -> A
     P3: free_bulk(c1)
         enque_to_free(c0)
             c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X
         __llist_add_batch(c0->waiting_for_gp_ttrace)
             c0->waiting_for_gp_ttrace = A -> ... -> Y -> X

In theory that's possible, but for this to happen one cpu needs
to be thousand times slower than all others and since there is no
preemption in llist_del_first I don't think we need to worry about it.
Also with removal of _tail optimization the above 
llist_add_batch(waiting_for_gp_ttrace)
will become a loop, so reused element will be at the very end
instead of top, so one cpu to million times slower which is not realistic.

P1:
     // A is added back as first again
     // but llist_del_first() didn't know
     try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B)
     // c0->waiting_for_gp_trrace is corrupted
     c0->waiting_for_gp_ttrace->first = B