Re: [PATCH] bpf: fix possible endless loop in BPF map iteration

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Tue, 25 Feb 2025 20:26:17 +0800

Hi,

On 2/25/2025 3:13 PM, Martin KaFai Lau wrote:
> On 2/24/25 2:01 PM, Brandon Kammerdiener wrote:
>> This patch fixes an endless loop condition that can occur in
>> bpf_for_each_hash_elem, causing the core to softlock. My
>> understanding is
>> that a combination of RCU list deletion and insertion introduces the new
>> element after the iteration cursor and that there is a chance that an
>> RCU
>
> new element is added to the head of the bucket, so the first thought
> is it should not extend the list beyond the current iteration point...
>
>> reader may in fact use this new element in iteration. The patch uses a
>> _safe variant of the macro which gets the next element to iterate before
>> executing the loop body for the current element. The following simple
>> BPF
>> program can be used to reproduce the issue:
>>
>>      #include "vmlinux.h"
>>      #include <bpf/bpf_helpers.h>
>>      #include <bpf/bpf_tracing.h>
>>
>>      #define N (64)
>>
>>      struct {
>>          __uint(type,        BPF_MAP_TYPE_HASH);
>>          __uint(max_entries, N);
>>          __type(key,         __u64);
>>          __type(value,       __u64);
>>      } map SEC(".maps");
>>
>>      static int cb(struct bpf_map *map, __u64 *key, __u64 *value,
>> void *arg) {
>>          bpf_map_delete_elem(map, key);
>>          bpf_map_update_elem(map, &key, &val, 0);
>
> I suspect what happened in this reproducer is,
> there is a bucket with more than one elem(s) and the deleted elem gets
> immediately added back to the bucket->head.
> Something like this, '[ ]' as the current elem.
>
> 1st iteration     (follow bucket->head.first): [elem_1] ->  elem_2
>                                   delete_elem:  elem_2
>                                   update_elem: [elem_1] ->  elem_2
> 2nd iteration (follow elem_1->hash_node.next):  elem_1  -> [elem_2]
>                                   delete_elem:  elem_1
>                                   update_elem: [elem_2] -> elem_1
> 3rd iteration (follow elem_2->hash_node.next):  elem_2  -> [elem_1]
>                   loop.......
>

Yes. The above procedure is exactly the test case tries to do (except
the &key and &val typos).

> don't think "_safe" covers all cases though. "_safe" may solve this
> particular reproducer which is shooting itself in the foot by deleting
> and adding itself when iterating a bucket.

Yes, if the bpf prog could delete and re-add the saved next entry, there
will be dead loop as well. It seems __htab_map_lookup_elem() may suffer
from the same problem just as bpf_for_each_hash_elem(). The problem is
mainly due to the immediate reuse of deleted element. Maybe we need to
add a seqlock to the htab_elem and retry the traversal if the seqlock is
not stable.
>
> [ btw, I don't think the test code can work as is. At least the "&key"
> arg of the bpf_map_update_elem looks wrong. ]
>
>>          return 0;
>>      }
>>
>>      SEC("uprobe//proc/self/exe:test")
>>      int BPF_PROG(test) {
>>          __u64 i;
>>
>>          bpf_for(i, 0, N) {
>>              bpf_map_update_elem(&map, &i, &i, 0);
>>          }
>>
>>          bpf_for_each_map_elem(&map, cb, NULL, 0);
>>
>>          return 0;
>>      }
>>
>>      char LICENSE[] SEC("license") = "GPL";
>>
>> Signed-off-by: Brandon Kammerdiener <brandon.kammerdiener@xxxxxxxxx>
>>
>> ---
>>   kernel/bpf/hashtab.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 4a9eeb7aef85..43574b0495c3 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -2224,7 +2224,7 @@ static long bpf_for_each_hash_elem(struct
>> bpf_map *map, bpf_callback_t callback_
>>           b = &htab->buckets[i];
>>           rcu_read_lock();
>>           head = &b->head;
>> -        hlist_nulls_for_each_entry_rcu(elem, n, head, hash_node) {
>> +        hlist_nulls_for_each_entry_safe(elem, n, head, hash_node) {
>>               key = elem->key;
>>               if (is_percpu) {
>>                   /* current cpu value for percpu map */
>> -- 
>> 2.48.1
>>
>
>
> .