Hello, > On Apr 22, 2021, at 12:45 AM, Laurent Dufour <ldufour@xxxxxxxxxxxxx> wrote: > > Le 22/04/2021 à 03:33, Dennis Zhou a écrit : >> Hello, >> On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote: >>> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to >>> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate >>> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by >>> number of chunks keeping number of units per chunk constant. This is done in that way to >>> simplify CPU hotplug/remove to have per-cpu area preallocated. >>> >>> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs, >>> where nr_cpu_ids is huge. >>> >>> Example from my experiment: >>> 2 vCPU VM with hotplug support (up to 128): >>> [ 0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs >>> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu >>> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total >>> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are >>> filled by ~75% in my experiment, so fragmentation is not a concern. >>> Out of 12.8 GB: >>> - 0.2 GB are actually used by present vCPUs, and >>> - 12.6 GB are "wasted"! >>> >>> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB. >>> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing >>> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel >>> parameter. But it won't eliminate fundamental issue with "wasted" memory. >>> >>> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new >>> units for existing chunks on CPU hotplug/remove event. >>> >> Idk. In theory it sounds doable. In practice I'm not so sure. The two >> problems off the top of my head: >> 1) What happens if we can't allocate new pages when a cpu is onlined? Simply - registering CPU can return error on allocation failure. Or potentially it can be reinstantiated later on memory availability if it’s the case. >> 2) It's possible users set particular conditions in percpu variables >> that are not tied to just statistics summing (such as the cpu >> runqueues). Users would have to provide online init and exit functions >> which could get weird. I do not think online init/exit function is a right approach. There are many places in the Linux where percpu data get initialized right after got allocated: ptr = alloc_percpu(); for_each_possible_cpu(cpu) { initialize (per_cpu_ptr(ptr, cpu)); } Let’s keep all such instances untouched. Hope initialize() just touch content of percpu area without allocating substructures. If so - it should be redesigned. BTW, this loop does extra work (runtime overhead) to initialize areas for possible cpus which might never arrive. The proposal: - in case of possible_cpus > online_cpus, add additional unit (call it A) to the chunks which will contain initialized image of percpu data for possible cpus. - for_each_possible_cpu(cpu) from snippet above should go through all online cpus + 1 (for unit A). - on new CPU #N arrival, percpu should allocate corresponding unit N and initialize its content by data from unit A. Repeat for all chunks. - on CPU D departure - release unit D from the chunks, keeping unit A intact. - in case of possible_cpus > online_cpus, overhead will be +1 (for unit A), while current overhead is +(possible_cpus-online_cpus). - in case of possible_cpus == online_cpus (no CPU hotplug) - do not allocate unit A, keep percpu allocator as it is now - no overhead. Does it fully cover 2nd concern? >> As Roman mentioned, I think it would be much better to not have the >> large discrepancy between the cpu_online_mask and the cpu_possible_mask. > > Indeed it is quite common on PowerPC to set a VM with a possible high number of CPUs but with a reasonnable number of online CPUs. This allows the user to scale up its VM when needed. > > For instance we may see up to 1024 possible CPUs while the online number is *only* 128. Agree. In VMs, vCPUs there are just threads/processes on the host and can be easily added/removed on demand. Thanks, —Alexey