Hi, On 2/27/2025 10:07 AM, Alexei Starovoitov wrote: > On Wed, Feb 26, 2025 at 5:45 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> Hi, >> >> On 2/26/2025 12:15 AM, Brandon Kammerdiener wrote: >>> On Tue, Feb 25, 2025 at 08:26:17PM +0800, Hou Tao wrote: >>>> Hi, >>>> >>>> On 2/25/2025 3:13 PM, Martin KaFai Lau wrote: >>>>> On 2/24/25 2:01 PM, Brandon Kammerdiener wrote: >>>>>> This patch fixes an endless loop condition that can occur in >>>>>> bpf_for_each_hash_elem, causing the core to softlock. My >>>>>> understanding is >>>>>> that a combination of RCU list deletion and insertion introduces the new >>>>>> element after the iteration cursor and that there is a chance that an >>>>>> RCU >>>>> new element is added to the head of the bucket, so the first thought >>>>> is it should not extend the list beyond the current iteration point... >>>>> >>>>>> reader may in fact use this new element in iteration. The patch uses a >>>>>> _safe variant of the macro which gets the next element to iterate before >>>>>> executing the loop body for the current element. The following simple >>>>>> BPF >>>>>> program can be used to reproduce the issue: >>>>>> >>>>>> #include "vmlinux.h" >>>>>> #include <bpf/bpf_helpers.h> >>>>>> #include <bpf/bpf_tracing.h> >>>>>> >>>>>> #define N (64) >>>>>> >>>>>> struct { >>>>>> __uint(type, BPF_MAP_TYPE_HASH); >>>>>> __uint(max_entries, N); >>>>>> __type(key, __u64); >>>>>> __type(value, __u64); >>>>>> } map SEC(".maps"); >>>>>> >>>>>> static int cb(struct bpf_map *map, __u64 *key, __u64 *value, >>>>>> void *arg) { >>>>>> bpf_map_delete_elem(map, key); >>>>>> bpf_map_update_elem(map, &key, &val, 0); >>>>> I suspect what happened in this reproducer is, >>>>> there is a bucket with more than one elem(s) and the deleted elem gets >>>>> immediately added back to the bucket->head. >>>>> Something like this, '[ ]' as the current elem. >>>>> >>>>> 1st iteration (follow bucket->head.first): [elem_1] -> elem_2 >>>>> delete_elem: elem_2 >>>>> update_elem: [elem_1] -> elem_2 >>>>> 2nd iteration (follow elem_1->hash_node.next): elem_1 -> [elem_2] >>>>> delete_elem: elem_1 >>>>> update_elem: [elem_2] -> elem_1 >>>>> 3rd iteration (follow elem_2->hash_node.next): elem_2 -> [elem_1] >>>>> loop....... >>>>> >>>> Yes. The above procedure is exactly the test case tries to do (except >>>> the &key and &val typos). >>> Yes, apologies for the typos I must have introduced when minifying the >>> example. Should just use key and val sans the &. >> OK >>>>> don't think "_safe" covers all cases though. "_safe" may solve this >>>>> particular reproducer which is shooting itself in the foot by deleting >>>>> and adding itself when iterating a bucket. >>>> Yes, if the bpf prog could delete and re-add the saved next entry, there >>>> will be dead loop as well. It seems __htab_map_lookup_elem() may suffer >>>> from the same problem just as bpf_for_each_hash_elem(). The problem is >>>> mainly due to the immediate reuse of deleted element. Maybe we need to >>>> add a seqlock to the htab_elem and retry the traversal if the seqlock is >>>> not stable. >> The seqlock + restart traversal way doesn't work, because the seq-count >> for each element will add 2 after the re-insert and the loop will always >> try to restart the traversal. I have another idea: how about add an >> per-bucket incremental id for each element in the bucket and during the >> traversal, the traversal will ignore the element with id greater than >> the id of current element, so it will ignore the newly-added element. >> For example, there are three elements in a bucket list: head -> A [id=3] >> -> B[id=2] -> C[id=1] >> >> (1) pass A to cb >> current id = 3 >> cb deletes A and inserts A again >> head -> A[4] -> B[2] -> C[1] >> >> (2) pass B to cb >> current id=2 >> cb deletes B and inserts B again >> head -> B[5] -> A[4] -> C[1] >> >> the id of A is greater than current id, so it is skipped. > This looks like unnecessary overhead that attempts > to reinvent nulls logic. OK. > > At present I'm not convinced that lookup_nulls_elem_raw() is broken. > The only issue with bpf_for_each_hash_elem() that loops forever > in this convoluted case and _safe() is the right fix for it. For lookup_nulls_elem_raw(), I think the dead loop in lookup depends on the exact synchronization between lookup procedure and update/deletion procedure, and it will hard to reproduce it.