Re: Percpu allocator: CPU hotplug support

Laurent Dufour <ldufour@xxxxxxxxxxxxx> · Thu, 22 Apr 2021 09:45:32 +0200

Le 22/04/2021 à 03:33, Dennis Zhou a écrit :
Hello,

On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
number of chunks keeping number of units per chunk constant. This is done in that way to
simplify CPU hotplug/remove to have per-cpu area preallocated.

Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
where nr_cpu_ids is huge.

Example from my experiment:
2 vCPU VM with hotplug support (up to 128):
[    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
filled by ~75% in my experiment, so fragmentation is not a concern.
Out of 12.8 GB:
  - 0.2 GB are actually used by present vCPUs, and
  - 12.6 GB are "wasted"!

I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
parameter. But it won't eliminate fundamental issue with "wasted" memory.

Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
units for existing chunks on CPU hotplug/remove event.

Idk. In theory it sounds doable. In practice I'm not so sure. The two
problems off the top of my head:
1) What happens if we can't allocate new pages when a cpu is onlined?
2) It's possible users set particular conditions in percpu variables
that are not tied to just statistics summing (such as the cpu
runqueues). Users would have to provide online init and exit functions
which could get weird.

As Roman mentioned, I think it would be much better to not have the
large discrepancy between the cpu_online_mask and the cpu_possible_mask.

Indeed it is quite common on PowerPC to set a VM with a possible high number of 
CPUs but with a reasonnable number of online CPUs. This allows the user to scale 
up its VM when needed.

For instance we may see up to 1024 possible CPUs while the online number is 
*only* 128.

Cheers,
Laurent.