On Wed, Sep 2, 2020 at 4:56 PM Yonghong Song <yhs@xxxxxx> wrote: > > Currently, for hashmap, the bpf iterator will grab a bucket lock, a > spinlock, before traversing the elements in the bucket. This can ensure > all bpf visted elements are valid. But this mechanism may cause > deadlock if update/deletion happens to the same bucket of the > visited map in the program. For example, if we added bpf_map_update_elem() > call to the same visited element in selftests bpf_iter_bpf_hash_map.c, > we will have the following deadlock: > [...] > > Compared to old bucket_lock mechanism, if concurrent updata/delete happens, > we may visit stale elements, miss some elements, or repeat some elements. > I think this is a reasonable compromise. For users wanting to avoid I agree, the only reliable way to iterate map without duplicates and missed elements is to not update that map during iteration (unless we start supporting point-in-time snapshots, which is a very different matter). > stale, missing/repeated accesses, bpf_map batch access syscall interface > can be used. > > Signed-off-by: Yonghong Song <yhs@xxxxxx> > --- > kernel/bpf/hashtab.c | 15 ++++----------- > 1 file changed, 4 insertions(+), 11 deletions(-) > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 78dfff6a501b..7df28a45c66b 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -1622,7 +1622,6 @@ struct bpf_iter_seq_hash_map_info { > struct bpf_map *map; > struct bpf_htab *htab; > void *percpu_value_buf; // non-zero means percpu hash > - unsigned long flags; > u32 bucket_id; > u32 skip_elems; > }; > @@ -1632,7 +1631,6 @@ bpf_hash_map_seq_find_next(struct bpf_iter_seq_hash_map_info *info, > struct htab_elem *prev_elem) > { > const struct bpf_htab *htab = info->htab; > - unsigned long flags = info->flags; > u32 skip_elems = info->skip_elems; > u32 bucket_id = info->bucket_id; > struct hlist_nulls_head *head; > @@ -1656,19 +1654,18 @@ bpf_hash_map_seq_find_next(struct bpf_iter_seq_hash_map_info *info, > > /* not found, unlock and go to the next bucket */ > b = &htab->buckets[bucket_id++]; > - htab_unlock_bucket(htab, b, flags); > + rcu_read_unlock(); Just double checking as I don't yet completely understand all the sleepable BPF implications. If the map is used from a sleepable BPF program, we are still ok doing just rcu_read_lock/rcu_read_unlock when accessing BPF map elements, right? No need for extra rcu_read_lock_trace/rcu_read_unlock_trace? > skip_elems = 0; > } > [...]