Re: [PATCH] bpf: fix possible endless loop in BPF map iteration

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 26 Feb 2025 18:07:31 -0800

On Wed, Feb 26, 2025 at 5:45 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 2/26/2025 12:15 AM, Brandon Kammerdiener wrote:
> > On Tue, Feb 25, 2025 at 08:26:17PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 2/25/2025 3:13 PM, Martin KaFai Lau wrote:
> >>> On 2/24/25 2:01 PM, Brandon Kammerdiener wrote:
> >>>> This patch fixes an endless loop condition that can occur in
> >>>> bpf_for_each_hash_elem, causing the core to softlock. My
> >>>> understanding is
> >>>> that a combination of RCU list deletion and insertion introduces the new
> >>>> element after the iteration cursor and that there is a chance that an
> >>>> RCU
> >>> new element is added to the head of the bucket, so the first thought
> >>> is it should not extend the list beyond the current iteration point...
> >>>
> >>>> reader may in fact use this new element in iteration. The patch uses a
> >>>> _safe variant of the macro which gets the next element to iterate before
> >>>> executing the loop body for the current element. The following simple
> >>>> BPF
> >>>> program can be used to reproduce the issue:
> >>>>
> >>>>      #include "vmlinux.h"
> >>>>      #include <bpf/bpf_helpers.h>
> >>>>      #include <bpf/bpf_tracing.h>
> >>>>
> >>>>      #define N (64)
> >>>>
> >>>>      struct {
> >>>>          __uint(type,        BPF_MAP_TYPE_HASH);
> >>>>          __uint(max_entries, N);
> >>>>          __type(key,         __u64);
> >>>>          __type(value,       __u64);
> >>>>      } map SEC(".maps");
> >>>>
> >>>>      static int cb(struct bpf_map *map, __u64 *key, __u64 *value,
> >>>> void *arg) {
> >>>>          bpf_map_delete_elem(map, key);
> >>>>          bpf_map_update_elem(map, &key, &val, 0);
> >>> I suspect what happened in this reproducer is,
> >>> there is a bucket with more than one elem(s) and the deleted elem gets
> >>> immediately added back to the bucket->head.
> >>> Something like this, '[ ]' as the current elem.
> >>>
> >>> 1st iteration     (follow bucket->head.first): [elem_1] ->  elem_2
> >>>                                   delete_elem:  elem_2
> >>>                                   update_elem: [elem_1] ->  elem_2
> >>> 2nd iteration (follow elem_1->hash_node.next):  elem_1  -> [elem_2]
> >>>                                   delete_elem:  elem_1
> >>>                                   update_elem: [elem_2] -> elem_1
> >>> 3rd iteration (follow elem_2->hash_node.next):  elem_2  -> [elem_1]
> >>>                   loop.......
> >>>
> >> Yes. The above procedure is exactly the test case tries to do (except
> >> the &key and &val typos).
> > Yes, apologies for the typos I must have introduced when minifying the
> > example. Should just use key and val sans the &.
>
> OK
> >
> >>> don't think "_safe" covers all cases though. "_safe" may solve this
> >>> particular reproducer which is shooting itself in the foot by deleting
> >>> and adding itself when iterating a bucket.
> >> Yes, if the bpf prog could delete and re-add the saved next entry, there
> >> will be dead loop as well. It seems __htab_map_lookup_elem() may suffer
> >> from the same problem just as bpf_for_each_hash_elem(). The problem is
> >> mainly due to the immediate reuse of deleted element. Maybe we need to
> >> add a seqlock to the htab_elem and retry the traversal if the seqlock is
> >> not stable.
>
> The seqlock + restart traversal way doesn't work, because the seq-count
> for each element will add 2 after the re-insert and the loop will always
> try to restart the traversal. I have another idea: how about add an
> per-bucket incremental id for each element in the bucket and during the
> traversal, the traversal will ignore the element with id greater than
> the id of current element, so it will ignore the newly-added element.
> For example, there are three elements in a bucket list: head -> A [id=3]
> -> B[id=2] -> C[id=1]
>
> (1) pass A to cb
> current id = 3
> cb deletes A and inserts A again
> head -> A[4] -> B[2] -> C[1]
>
> (2) pass B to cb
> current id=2
> cb deletes B and inserts B again
> head -> B[5] -> A[4] -> C[1]
>
> the id of A is greater than current id, so it is skipped.

This looks like unnecessary overhead that attempts
to reinvent nulls logic.

At present I'm not convinced that lookup_nulls_elem_raw() is broken.
The only issue with bpf_for_each_hash_elem() that loops forever
in this convoluted case and _safe() is the right fix for it.