Hi, On 11/4/2023 12:49 AM, Yonghong Song wrote: > > On 11/2/23 11:54 PM, Hou Tao wrote: >> Hi, >> >> On 11/3/2023 12:08 AM, Yonghong Song wrote: >>> On 11/2/23 6:40 AM, Hou Tao wrote: >>>> Hi Alexei, >>>> >>>> On 10/31/2023 4:01 PM, Hou Tao wrote: >>>>> Hi, >>>>> >>>>> On 10/30/2023 10:11 PM, kernel test robot wrote: >>>>>> hi, Hou Tao, >>>>>> >>>>>> we noticed a WARN_ONCE added in this commit was hit in our tests. >>>>>> FYI. >>>>>> >>>>>> SNIP >>>> I see what has happened. The problem is twofold: >>>> (1) The object_size of kmalloc-cg-96 is adjust from 96 to 128 due to >>>> slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is >>>> enabled by default for kmalloc slab, so align is 64 and size is 128 >>>> for >>>> kmalloc-cg-96. So when unit_alloc() does kmalloc_node(96, >>>> __GFP_ACCOUNT, >>>> node), ksize() will return 128 instead of 96 for the returned pointer. >>>> SLUB has a similar merge logic, but because its align is 8 under >>>> x86-64, >>>> so the warning doesn't happen for i386 + SLUB, but I think the similar >>>> problem may exist for other architectures. >>>> (2) kmalloc_size_roundup() returns the object_size of kmalloc-96 >>>> instead >>>> of kmalloc-cg-96, so bpf_mem_cache_adjust_size() doesn't adjust >>>> size_index accordingly. The reason why the object_size of >>>> kmalloc-96 is >>>> 96 instead of 128 is that there is slab merge for kmalloc-96. >>>> >>>> About how to fix the problem, I have two ideas: >>>> The first is to introduce kmalloc_size_roundup_flags(), so >>>> bpf_mem_cache_adjust_size() could use kmalloc_size_roundup_flags(size, >>>> __GFP_ACCOUNT) to get the object_size of kmalloc-cg-xxx. It could fix >>>> the warning for now, but the warning may pop-up occasionally due to >>>> SLUB >>>> merge and unusual slab align. The second is just using the >>>> bpf_mem_cache >>>> pointer to get the unit_size which is saved before the to-be-free >>>> pointer. Its downside is that it may can not be able to skip the free >>>> operation for pointer which is not allocated from bpf ma, but I >>>> think it >>>> is acceptable. I prefer the latter solution. What do you think ? >>> >>> Is it possible that in bpf_mem_cache_adjust_size(), we do a series of >>> kmalloc (for supported bucket size) and call ksize() to get the actual >>> allocated object size. So eventually all possible allocated object >>> sizes >>> will be used for size_index[]. This will avoid all kind of special >>> corner cases due to config/macro/arch etc. WDYT? >> It is basically the same as the first proposed solution and it has the >> same flaw. The problem is that slab merge can happen in any time, so the >> return value of ksize() may change even all passed pointers are >> allocated from the same slab. Considering the following case: during the >> invocation of bpf_mem_cache_adjust_size() or the initialization of >> bpf_global_ma, there is no slab merge and ksize() for a 96-bytes object >> returns 96. But after these invocations, a new slab created by a kernel >> module is merged to kmalloc-cg-96 and the object_size of kmalloc-cg-96 >> is adjust from 96 to 128 (which is possible for x86-64 + CONFIG_SLAB, >> because it is alignment requirement is 64 for 96-bytes slab). So soon or > > So, the object_size for allocated objects in that is adjusted from 96 > to 128 > while previously allocated objects should have no change, it is merely > ksize(old_obj) > previous return 96, now returns 128, right? Okay, so this is indeed a > problem > since we use ksize() to decide the bucket. Yes. The object_size of underlying slab changes, so the return value of ksize() will change as well. > > >> later, when bpf_global_ma frees a 96-byte-sized pointer which is >> allocated from a bpf_mem_cache in which unit_size is 96, bpf_mem_free() >> will free the pointer through a bpf_mem_cache in which unit_size is 128, >> because the return value of ksize() changes. Maybe we should introduce a >> new API in mm which returns size instead of object_size of underlying >> slab, so the return value will not change due to slab merge. > > In this case, to avoid the warning, indeed we need to use '96' instead > of '128'. > So use the original ksize() return value is indeed a solution. > We could use the mechanism similar to percpu alloc to save '96' in the > memory. We have already saved the pointer of bpf_mem_cache in the extra space (aka LLIST_NODE_SZ) which is allocated together with the returned pointer, so I think we could use bpf_mem_cache->unit_size to get the size of the free pointer directly. I will check whether or not there is performance degradation before posting the patch. Regards, Tao