On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@xxxxxxxx> wrote: > > > > On 1/3/23 10:30 PM, Hou Tao wrote: > > Hi, > > > > On 1/4/2023 2:10 PM, Yonghong Song wrote: > >> > >> > >> On 1/3/23 5:47 AM, Hou Tao wrote: > >>> Hi, > >>> > >>> On 1/2/2023 2:48 AM, Yonghong Song wrote: > >>>> > >>>> > >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote: > >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: > >>>>>> From: Hou Tao <houtao1@xxxxxxxxxx> > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> The patchset tries to fix the problems found when checking how htab map > >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of > >>>>>> freed elements may lead to two problems in htab map: > >>>>>> > >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in > >>>>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK > >>>>>> flag which acquires bpf-spin-lock during value copying. The > >>>>>> corruption of bpf-spin-lock may result in hard lock-up. > >>>>>> (2) lookup procedure may get incorrect map value if the found element is > >>>>>> freed and then reused. > >>>>>> > >>>>>> Because the type of htab map elements are the same, so problem #1 can be > >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes > >>>>>> these special fields in map element only when the map element is newly > >>>>>> allocated. If it is just a reused element, there will be no > >>>>>> reinitialization. > >>>>> > >>>>> Instead of adding the overhead of ctor callback let's just > >>>>> add __GFP_ZERO to flags in __alloc(). > >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just > >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default > >>>>> will behave the same way. > >>>> > >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@xxxxxxxxx/ > >>>> tried to address a similar issue for lru hash table. > >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for > >>>> hash table? > >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? > >> > >> The following is my understanding: > >> in function alloc_htab_elem() (hashtab.c), we have > >> > >> if (is_map_full(htab)) > >> if (!old_elem) > >> /* when map is full and update() is replacing > >> * old element, it's ok to allocate, since > >> * old element will be freed immediately. > >> * Otherwise return an error > >> */ > >> return ERR_PTR(-E2BIG); > >> inc_elem_count(htab); > >> l_new = bpf_mem_cache_alloc(&htab->ma); > >> if (!l_new) { > >> l_new = ERR_PTR(-ENOMEM); > >> goto dec_count; > >> } > >> check_and_init_map_value(&htab->map, > >> l_new->key + round_up(key_size, 8)); > >> > >> In the above check_and_init_map_value() intends to do initializing > >> for an element from bpf_mem_cache_alloc (could be reused from the free list). > >> > >> The check_and_init_map_value() looks like below (in include/linux/bpf.h) > >> > >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) > >> { > >> int i; > >> > >> if (!foffs) > >> return; > >> for (i = 0; i < foffs->cnt; i++) > >> memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); > >> } > >> > >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst) > >> { > >> bpf_obj_init(map->field_offs, dst); > >> } > >> > >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer, > >> list_head, list_node, etc. > >> > >> This is the problem for above problem #1. > >> Maybe I missed something? > > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the > > problem by only calling check_and_init_map_value() once for the newly-allocated > > element, so if a freed element is reused, its special fields will not be zeroed > > again. Is there any other cases which are not covered by the solution or any > > other similar problems in hash-tab ? > > No, I checked all cases of check_and_init_map_value() and didn't find > any other instances. check_and_init_map_value() is called in two other cases: lookup_and_delete[_batch]. There the zeroing of the fields is necessary because the 'value' is a temp buffer that is going to be copied to user space. I think the way forward is to add GFP_ZERO to mem_alloc (to make it equivalent to prealloc), remove one case of check_and_init_map_value from hashmap, add short comments to two other cases and add a big comment to check_and_init_map_value() that should say that 'dst' must be a temp buffer and should not point to memory that could be used in parallel by a bpf prog. It feels like we've dealt with this issue a couple times already and keep repeating this mistake, so the more comments the better.