Hi, On 11/8/2023 12:04 AM, Yonghong Song wrote: > > On 11/7/23 2:43 AM, Hou Tao wrote: >> Hi, >> >> On 11/4/2023 12:49 AM, Yonghong Song wrote: >>> On 11/2/23 11:54 PM, Hou Tao wrote: >>>> Hi, >>>> >>>> On 11/3/2023 12:08 AM, Yonghong Song wrote: >>>>> On 11/2/23 6:40 AM, Hou Tao wrote: >>>>>> Hi Alexei, >>>>>> >>>>>> On 10/31/2023 4:01 PM, Hou Tao wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 10/30/2023 10:11 PM, kernel test robot wrote: >>>>>>>> hi, Hou Tao, >>>>>>>> >>>>>>>> we noticed a WARN_ONCE added in this commit was hit in our tests. >>>>>>>> FYI. >>>>>>>> >>>>>>>> >> SNIP >>>>>> I see what has happened. The problem is twofold: >>>>>> (1) The object_size of kmalloc-cg-96 is adjust from 96 to 128 due to >>>>>> slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is >>>>>> enabled by default for kmalloc slab, so align is 64 and size is 128 >>>>>> for >>>>>> kmalloc-cg-96. So when unit_alloc() does kmalloc_node(96, >>>>>> __GFP_ACCOUNT, >>>>>> node), ksize() will return 128 instead of 96 for the returned >>>>>> pointer. >>>>>> SLUB has a similar merge logic, but because its align is 8 under >>>>>> x86-64, >>>>>> so the warning doesn't happen for i386 + SLUB, but I think the >>>>>> similar >>>>>> problem may exist for other architectures. >>>>>> (2) kmalloc_size_roundup() returns the object_size of kmalloc-96 >>>>>> instead >>>>>> of kmalloc-cg-96, so bpf_mem_cache_adjust_size() doesn't adjust >>>>>> size_index accordingly. The reason why the object_size of >>>>>> kmalloc-96 is >>>>>> 96 instead of 128 is that there is slab merge for kmalloc-96. >>>>>> >>>>>> About how to fix the problem, I have two ideas: >>>>>> The first is to introduce kmalloc_size_roundup_flags(), so >>>>>> bpf_mem_cache_adjust_size() could use >>>>>> kmalloc_size_roundup_flags(size, >>>>>> __GFP_ACCOUNT) to get the object_size of kmalloc-cg-xxx. It could >>>>>> fix >>>>>> the warning for now, but the warning may pop-up occasionally due to >>>>>> SLUB >>>>>> merge and unusual slab align. The second is just using the >>>>>> bpf_mem_cache >>>>>> pointer to get the unit_size which is saved before the to-be-free >>>>>> pointer. Its downside is that it may can not be able to skip the >>>>>> free >>>>>> operation for pointer which is not allocated from bpf ma, but I >>>>>> think it >>>>>> is acceptable. I prefer the latter solution. What do you think ? >>>>> Is it possible that in bpf_mem_cache_adjust_size(), we do a series of >>>>> kmalloc (for supported bucket size) and call ksize() to get the >>>>> actual >>>>> allocated object size. So eventually all possible allocated object >>>>> sizes >>>>> will be used for size_index[]. This will avoid all kind of special >>>>> corner cases due to config/macro/arch etc. WDYT? >>>> It is basically the same as the first proposed solution and it has the >>>> same flaw. The problem is that slab merge can happen in any time, >>>> so the >>>> return value of ksize() may change even all passed pointers are >>>> allocated from the same slab. Considering the following case: >>>> during the >>>> invocation of bpf_mem_cache_adjust_size() or the initialization of >>>> bpf_global_ma, there is no slab merge and ksize() for a 96-bytes >>>> object >>>> returns 96. But after these invocations, a new slab created by a >>>> kernel >>>> module is merged to kmalloc-cg-96 and the object_size of kmalloc-cg-96 >>>> is adjust from 96 to 128 (which is possible for x86-64 + CONFIG_SLAB, >>>> because it is alignment requirement is 64 for 96-bytes slab). So >>>> soon or >>> So, the object_size for allocated objects in that is adjusted from 96 >>> to 128 >>> while previously allocated objects should have no change, it is merely >>> ksize(old_obj) >>> previous return 96, now returns 128, right? Okay, so this is indeed a >>> problem >>> since we use ksize() to decide the bucket. >> Yes. The object_size of underlying slab changes, so the return value of >> ksize() will change as well. >>> >>>> later, when bpf_global_ma frees a 96-byte-sized pointer which is >>>> allocated from a bpf_mem_cache in which unit_size is 96, >>>> bpf_mem_free() >>>> will free the pointer through a bpf_mem_cache in which unit_size is >>>> 128, >>>> because the return value of ksize() changes. Maybe we should >>>> introduce a >>>> new API in mm which returns size instead of object_size of underlying >>>> slab, so the return value will not change due to slab merge. >>> In this case, to avoid the warning, indeed we need to use '96' instead >>> of '128'. >>> So use the original ksize() return value is indeed a solution. >>> We could use the mechanism similar to percpu alloc to save '96' in the >>> memory. >> We have already saved the pointer of bpf_mem_cache in the extra space >> (aka LLIST_NODE_SZ) which is allocated together with the returned >> pointer, so I think we could use bpf_mem_cache->unit_size to get the >> size of the free pointer directly. I will check whether or not there is >> performance degradation before posting the patch. Post for status update. There is no benchmark for bpf_mem_free(), so I am stilling write the benchmark for bpf_mem_alloc() and bpf_mem_free(). After the benchmark is ready (I think it will be next week), I will post the fix patch. > > See: > > bpf_local_storage.c: err = > bpf_mem_alloc_init(&smap->selem_ma, smap->elem_size, false); > bpf_local_storage.c: err = > bpf_mem_alloc_init(&smap->storage_ma, sizeof(struct > bpf_local_storage), false); > core.c: ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false); > core.c: ret = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); > cpumask.c: ret = bpf_mem_alloc_init(&bpf_cpumask_ma, > sizeof(struct bpf_cpumask), false); > hashtab.c: err = bpf_mem_alloc_init(&htab->ma, > htab->elem_size, false); > hashtab.c: err = bpf_mem_alloc_init(&htab->pcpu_ma, > memalloc.c:int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, > bool percpu) > > Some 'size' parameter in core.c is zero. > Not sure how exactly you will resolve this issue based on > bpf_mem_cache->unit_size. But looking forward to your patch! > >> >> Regards, >> Tao >> > > .