Re: [PATCH bpf-next v5 3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Thu, 21 Dec 2023 16:42:15 +0800



Hi,

On 12/21/2023 3:52 PM, Yonghong Song wrote:
>
> On 12/20/23 11:16 PM, Yonghong Song wrote:
>>
>> On 12/20/23 10:26 PM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 12/21/2023 1:00 PM, Yonghong Song wrote:
>>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem
>>>> allocation")
>>>> added support for non-fix-size percpu memory allocation.
>>>> Such allocation will allocate percpu memory for all buckets on all
>>>> cpus and the memory consumption is in the order to quadratic.
>>>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256
>>>> bytes.
>>>> Then let us say, 8 cpus with the same unit size, each cpu
>>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024
>>>> bytes.
>>>> So if the number of cpus doubles, the number of memory consumption
>>>> will be 4 times. So for a system with large number of cpus, the
>>>> memory consumption goes up quickly with quadratic order.
>>>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>>>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>>>>
>>>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is
>>>> done in boot time, so for system with large number of cpus, the
>>>> initial
>>>> percpu memory consumption is very visible. For example, for 128 cpu
>>>> system, the total percpu memory allocation will be at least
>>>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
>>>>    * 128 * 128 = ~138MB.
>>>> which is pretty big. It will be even bigger for larger number of cpus.
>>> SNIP
>>>> +
>>>>   static void drain_mem_cache(struct bpf_mem_cache *c)
>>>>   {
>>>>       bool percpu = !!c->percpu_size;
>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>>> index f13008d27f35..08f9a49cc11c 100644
>>>> --- a/kernel/bpf/verifier.c
>>>> +++ b/kernel/bpf/verifier.c
>>>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct
>>>> bpf_verifier_env *env, struct bpf_insn *insn,
>>>>                   if (meta.func_id ==
>>>> special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set)
>>>>                       return -ENOMEM;
>>>>   -                if (meta.func_id ==
>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>> -                    if (!bpf_global_percpu_ma_set) {
>>>> -                        mutex_lock(&bpf_percpu_ma_lock);
>>>> -                        if (!bpf_global_percpu_ma_set) {
>>>> -                            err =
>>>> bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
>>>> -                            if (!err)
>>>> -                                bpf_global_percpu_ma_set = true;
>>>> -                        }
>>>> - mutex_unlock(&bpf_percpu_ma_lock);
>>>> -                        if (err)
>>>> -                            return err;
>>>> -                    }
>>>> -                }
>>>> -
>>>>                   if (((u64)(u32)meta.arg_constant.value) !=
>>>> meta.arg_constant.value) {
>>>>                       verbose(env, "local type ID argument must be
>>>> in range [0, U32_MAX]\n");
>>>>                       return -EINVAL;
>>>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct
>>>> bpf_verifier_env *env, struct bpf_insn *insn,
>>>>                       return -EINVAL;
>>>>                   }
>>>>   +                if (meta.func_id ==
>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>> +                    if (!bpf_global_percpu_ma_set) {
>>>> +                        mutex_lock(&bpf_percpu_ma_lock);
>>>> +                        if (!bpf_global_percpu_ma_set) {
>>>> +                            err =
>>>> bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma);
>>> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I
>>> think the memory account will be incorrect, right ? Maybe we should
>>> pass
>>> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think
>>> the objcg is NULL.
>>
>> You are correct. Calling bpf_mem_alloc_percpu_init() in init stage
>> is exactly the reason to have proper root memcg for objcg. Sorry I
>> missed it.
>>
>> I remembered I indeed traced it a few days ago and indeed it is NULL.
>> There are three ways to resolve this:
>>    1 Just do 'ma->objcg = NULL' unconditionally in
>> bpf_mem_alloc_percpu_init().
>>    2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at
>> init stage,
>>      e.g., in bpf_global_ma_init() init function (core.c), and later
>> it can
>>      be used in bpf_mem_alloc_percpu_init().
>>    3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize
>> ma->objcg
>>      properly. But delay __alloc_percpu_gfp() later when verifier
>> found a call
>>      to bpf_percpu_obj_new(). We could add a call
>> bpf_mem_alloc_percpu_init_caches()
>>      to do __alloc_percpu_grp().
>>
>> I prefer option 3, what do you think?
>
> The option 4 below:
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 984c83ecace9..f90989cc9cbc 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -122,6 +122,7 @@ struct bpf_mem_caches {
>  };
>  
>  static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256,
> 512, 1024, 2048, 4096};
> +static struct obj_cgroup *objcg_at_init __ro_after_init;
>  
>  static struct llist_node notrace *__llist_del_first(struct llist_head
> *head)
>  {
> @@ -590,7 +591,7 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc
> *ma)
>         ma->percpu = true;
>  
>  #ifdef CONFIG_MEMCG_KMEM
> -       ma->objcg = get_obj_cgroup_from_current();
> +       ma->objcg = objcg_at_init;
>  #else
>         ma->objcg = NULL;
>  #endif
> @@ -1015,3 +1016,10 @@ void notrace *bpf_mem_cache_alloc_flags(struct
> bpf_mem_alloc *ma, gfp_t flags)
>  
>         return !ret ? NULL : ret + LLIST_NODE_SZ;
>  }
> +
> +static int __init find_objcg_at_init(void)
> +{
> +       objcg_at_init = get_obj_cgroup_from_current();
> +       return 0;
> +}
> +late_initcall(find_objcg_at_init);
>
> It seems this is better?

It seems that you are worried about the objcg of root memcg may change
from NULL to something else one day, right ? If it is the case, I think
both option 3 and 4 are fine. But I still think passing the desired
objcg to bpf_mem_alloc_percpu_init() directly is better. If the objcg of
root memcg is not NULL afterwards, we can update the passed parameter
accordingly.
>
>>
>>>> +                            if (!err)
>>>> +                                bpf_global_percpu_ma_set = true;
>>>> +                        }
>>>> + mutex_unlock(&bpf_percpu_ma_lock);
>>>> +                        if (err)
>>>> +                            return err;
>>>> +                    }
>>>> +
>>>> +                    mutex_lock(&bpf_percpu_ma_lock);
>>>> +                    err =
>>>> bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size);
>>>> +                    mutex_unlock(&bpf_percpu_ma_lock);
>>>> +                    if (err)
>>>> +                        return err;
>>>> +                }
>>>> +
>>>>                   struct_meta = btf_find_struct_meta(ret_btf,
>>>> ret_btf_id);
>>>>                   if (meta.func_id ==
>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
>>>>                       if (!__btf_type_is_scalar_struct(env,
>>>> ret_btf, ret_t, 0)) {
>>>
>>