Hi, On 12/21/2023 3:52 PM, Yonghong Song wrote: > > On 12/20/23 11:16 PM, Yonghong Song wrote: >> >> On 12/20/23 10:26 PM, Hou Tao wrote: >>> Hi, >>> >>> On 12/21/2023 1:00 PM, Yonghong Song wrote: >>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>>> allocation") >>>> added support for non-fix-size percpu memory allocation. >>>> Such allocation will allocate percpu memory for all buckets on all >>>> cpus and the memory consumption is in the order to quadratic. >>>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>>> bytes. >>>> Then let us say, 8 cpus with the same unit size, each cpu >>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>>> bytes. >>>> So if the number of cpus doubles, the number of memory consumption >>>> will be 4 times. So for a system with large number of cpus, the >>>> memory consumption goes up quickly with quadratic order. >>>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >>>> >>>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is >>>> done in boot time, so for system with large number of cpus, the >>>> initial >>>> percpu memory consumption is very visible. For example, for 128 cpu >>>> system, the total percpu memory allocation will be at least >>>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) >>>> * 128 * 128 = ~138MB. >>>> which is pretty big. It will be even bigger for larger number of cpus. >>> SNIP >>>> + >>>> static void drain_mem_cache(struct bpf_mem_cache *c) >>>> { >>>> bool percpu = !!c->percpu_size; >>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >>>> index f13008d27f35..08f9a49cc11c 100644 >>>> --- a/kernel/bpf/verifier.c >>>> +++ b/kernel/bpf/verifier.c >>>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct >>>> bpf_verifier_env *env, struct bpf_insn *insn, >>>> if (meta.func_id == >>>> special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) >>>> return -ENOMEM; >>>> - if (meta.func_id == >>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>> - if (!bpf_global_percpu_ma_set) { >>>> - mutex_lock(&bpf_percpu_ma_lock); >>>> - if (!bpf_global_percpu_ma_set) { >>>> - err = >>>> bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); >>>> - if (!err) >>>> - bpf_global_percpu_ma_set = true; >>>> - } >>>> - mutex_unlock(&bpf_percpu_ma_lock); >>>> - if (err) >>>> - return err; >>>> - } >>>> - } >>>> - >>>> if (((u64)(u32)meta.arg_constant.value) != >>>> meta.arg_constant.value) { >>>> verbose(env, "local type ID argument must be >>>> in range [0, U32_MAX]\n"); >>>> return -EINVAL; >>>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct >>>> bpf_verifier_env *env, struct bpf_insn *insn, >>>> return -EINVAL; >>>> } >>>> + if (meta.func_id == >>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>> + if (!bpf_global_percpu_ma_set) { >>>> + mutex_lock(&bpf_percpu_ma_lock); >>>> + if (!bpf_global_percpu_ma_set) { >>>> + err = >>>> bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); >>> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I >>> think the memory account will be incorrect, right ? Maybe we should >>> pass >>> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think >>> the objcg is NULL. >> >> You are correct. Calling bpf_mem_alloc_percpu_init() in init stage >> is exactly the reason to have proper root memcg for objcg. Sorry I >> missed it. >> >> I remembered I indeed traced it a few days ago and indeed it is NULL. >> There are three ways to resolve this: >> 1 Just do 'ma->objcg = NULL' unconditionally in >> bpf_mem_alloc_percpu_init(). >> 2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at >> init stage, >> e.g., in bpf_global_ma_init() init function (core.c), and later >> it can >> be used in bpf_mem_alloc_percpu_init(). >> 3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize >> ma->objcg >> properly. But delay __alloc_percpu_gfp() later when verifier >> found a call >> to bpf_percpu_obj_new(). We could add a call >> bpf_mem_alloc_percpu_init_caches() >> to do __alloc_percpu_grp(). >> >> I prefer option 3, what do you think? > > The option 4 below: > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 984c83ecace9..f90989cc9cbc 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -122,6 +122,7 @@ struct bpf_mem_caches { > }; > > static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, > 512, 1024, 2048, 4096}; > +static struct obj_cgroup *objcg_at_init __ro_after_init; > > static struct llist_node notrace *__llist_del_first(struct llist_head > *head) > { > @@ -590,7 +591,7 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc > *ma) > ma->percpu = true; > > #ifdef CONFIG_MEMCG_KMEM > - ma->objcg = get_obj_cgroup_from_current(); > + ma->objcg = objcg_at_init; > #else > ma->objcg = NULL; > #endif > @@ -1015,3 +1016,10 @@ void notrace *bpf_mem_cache_alloc_flags(struct > bpf_mem_alloc *ma, gfp_t flags) > > return !ret ? NULL : ret + LLIST_NODE_SZ; > } > + > +static int __init find_objcg_at_init(void) > +{ > + objcg_at_init = get_obj_cgroup_from_current(); > + return 0; > +} > +late_initcall(find_objcg_at_init); > > It seems this is better? It seems that you are worried about the objcg of root memcg may change from NULL to something else one day, right ? If it is the case, I think both option 3 and 4 are fine. But I still think passing the desired objcg to bpf_mem_alloc_percpu_init() directly is better. If the objcg of root memcg is not NULL afterwards, we can update the passed parameter accordingly. > >> >>>> + if (!err) >>>> + bpf_global_percpu_ma_set = true; >>>> + } >>>> + mutex_unlock(&bpf_percpu_ma_lock); >>>> + if (err) >>>> + return err; >>>> + } >>>> + >>>> + mutex_lock(&bpf_percpu_ma_lock); >>>> + err = >>>> bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); >>>> + mutex_unlock(&bpf_percpu_ma_lock); >>>> + if (err) >>>> + return err; >>>> + } >>>> + >>>> struct_meta = btf_find_struct_meta(ret_btf, >>>> ret_btf_id); >>>> if (meta.func_id == >>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>> if (!__btf_type_is_scalar_struct(env, >>>> ret_btf, ret_t, 0)) { >>> >>