On 6/26/23 12:16 AM, Hou Tao wrote:
Hi,
On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
Hi,
On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
From: Alexei Starovoitov <ast@xxxxxxxxxx>
alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
---
kernel/bpf/memalloc.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 692a9a30c1dc..666917c16e87 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
if (i >= cnt)
return;
+ for (; i < cnt; i++) {
+ obj = llist_del_first(&c->waiting_for_gp_ttrace);
After allowing to reuse elements from waiting_for_gp_ttrace, there may
be concurrent llist_del_first() and llist_del_all() as shown below and
llist_del_first() is not safe because the elements freed from free_rcu()
could be reused immediately and head->first may be added back to
c0->waiting_for_gp_ttrace by other process.
// c0
alloc_bulk()
llist_del_first(&c->waiting_for_gp_ttrace)
// c1->tgt = c0
free_rcu()
llist_del_all(&c->waiting_for_gp_ttrace)
I'm still thinking about how to fix the other issues you've reported,
but this one, I believe, is fine.
Are you basing 'not safe' on a comment?
Why xchg(&head->first, NULL); on one cpu and
try_cmpxchg(&head->first, &entry, next);
is unsafe?
No, I didn't reason it only based on the comment in llist.h. The reason
I though it was not safe because llist_del_first() may have ABA problem
as pointed by you early, because after __free_rcu(), the elements on
waiting_for_gp_ttrace would be available immediately and
waiting_for_gp_ttrace->first may be reused and then added back to
waiting_for_gp_ttrace->first again as shown below.
// c0->waiting_for_gp_ttrace
A -> B -> C -> nil
P1:
alloc_bulk(c0)
llist_del_first(&c0->waiting_for_gp_ttrace)
entry = A
next = B
P2: __free_rcu
// A/B/C is freed back to slab
// c0->waiting_for_gp_ttrace->first = NULL
free_all(llist_del_all(c0))
P3: alloc_bulk(c1)
// got A
__kmalloc()
// free A (from c1), ..., last free X (allocated from c0)
P3: unit_free(c1)
// the last freed element X is from c0
c1->tgt = c0
c1->free_llist->first -> X -> Y -> ... -> A
P3: free_bulk(c1)
enque_to_free(c0)
c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X
__llist_add_batch(c0->waiting_for_gp_ttrace)
c0->waiting_for_gp_ttrace = A -> ... -> Y -> X
In theory that's possible, but for this to happen one cpu needs
to be thousand times slower than all others and since there is no
preemption in llist_del_first I don't think we need to worry about it.
Also with removal of _tail optimization the above
llist_add_batch(waiting_for_gp_ttrace)
will become a loop, so reused element will be at the very end
instead of top, so one cpu to million times slower which is not realistic.
P1:
// A is added back as first again
// but llist_del_first() didn't know
try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B)
// c0->waiting_for_gp_trrace is corrupted
c0->waiting_for_gp_ttrace->first = B