On Wed, Sep 02, 2020 at 07:44:34PM -0700, Yonghong Song wrote: > > > On 9/2/20 6:25 PM, Andrii Nakryiko wrote: > > On Wed, Sep 2, 2020 at 4:56 PM Yonghong Song <yhs@xxxxxx> wrote: > > > > > > Currently, for hashmap, the bpf iterator will grab a bucket lock, a > > > spinlock, before traversing the elements in the bucket. This can ensure > > > all bpf visted elements are valid. But this mechanism may cause > > > deadlock if update/deletion happens to the same bucket of the > > > visited map in the program. For example, if we added bpf_map_update_elem() > > > call to the same visited element in selftests bpf_iter_bpf_hash_map.c, > > > we will have the following deadlock: > > > > > > > [...] > > > > > > > > Compared to old bucket_lock mechanism, if concurrent updata/delete happens, > > > we may visit stale elements, miss some elements, or repeat some elements. > > > I think this is a reasonable compromise. For users wanting to avoid > > > > I agree, the only reliable way to iterate map without duplicates and > > missed elements is to not update that map during iteration (unless we > > start supporting point-in-time snapshots, which is a very different > > matter). > > > > > > > stale, missing/repeated accesses, bpf_map batch access syscall interface > > > can be used. > > > > > > Signed-off-by: Yonghong Song <yhs@xxxxxx> > > > --- > > > kernel/bpf/hashtab.c | 15 ++++----------- > > > 1 file changed, 4 insertions(+), 11 deletions(-) > > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > > index 78dfff6a501b..7df28a45c66b 100644 > > > --- a/kernel/bpf/hashtab.c > > > +++ b/kernel/bpf/hashtab.c > > > @@ -1622,7 +1622,6 @@ struct bpf_iter_seq_hash_map_info { > > > struct bpf_map *map; > > > struct bpf_htab *htab; > > > void *percpu_value_buf; // non-zero means percpu hash > > > - unsigned long flags; > > > u32 bucket_id; > > > u32 skip_elems; > > > }; > > > @@ -1632,7 +1631,6 @@ bpf_hash_map_seq_find_next(struct bpf_iter_seq_hash_map_info *info, > > > struct htab_elem *prev_elem) > > > { > > > const struct bpf_htab *htab = info->htab; > > > - unsigned long flags = info->flags; > > > u32 skip_elems = info->skip_elems; > > > u32 bucket_id = info->bucket_id; > > > struct hlist_nulls_head *head; > > > @@ -1656,19 +1654,18 @@ bpf_hash_map_seq_find_next(struct bpf_iter_seq_hash_map_info *info, > > > > > > /* not found, unlock and go to the next bucket */ > > > b = &htab->buckets[bucket_id++]; > > > - htab_unlock_bucket(htab, b, flags); > > > + rcu_read_unlock(); > > > > Just double checking as I don't yet completely understand all the > > sleepable BPF implications. If the map is used from a sleepable BPF > > program, we are still ok doing just rcu_read_lock/rcu_read_unlock when > > accessing BPF map elements, right? No need for extra > > rcu_read_lock_trace/rcu_read_unlock_trace? > I think it is fine now since currently bpf_iter program cannot be sleepable > and the current sleepable program framework already allows the following > scenario. > - map1 is a preallocated hashmap shared by two programs, > prog1_nosleep and prog2_sleepable > > ... ... > rcu_read_lock() rcu_read_lock_trace() > run prog1_nosleep run prog2_sleepable > lookup/update/delete map1 elem lookup/update/delete map1 elem > rcu_read_unlock() rcu_read_unlock_trace() > ... ... rcu_trace doesn't protect the map. It protects the program. Even for prog2_sleepable the map is protected by rcu. The whole map including all elements will be freed after both sleepable and non-sleepable progs stop executing. This rcu_read_lock is needed for non-preallocated hash maps where individual elements are rcu protected. See free_htab_elem() doing call_rcu(). When the combination of sleepable progs and non-prealloc hashmap is enabled we would need to revisit this iterator assumption.