Re: Percpu allocator: CPU hotplug support

Alexey Makhalov <amakhalov@xxxxxxxxxx> · Thu, 22 Apr 2021 01:22:15 -0700

Hello,

> On Apr 22, 2021, at 12:45 AM, Laurent Dufour <ldufour@xxxxxxxxxxxxx> wrote:
> 
> Le 22/04/2021 à 03:33, Dennis Zhou a écrit :
>> Hello,
>> On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
>>> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
>>> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
>>> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
>>> number of chunks keeping number of units per chunk constant. This is done in that way to
>>> simplify CPU hotplug/remove to have per-cpu area preallocated.
>>> 
>>> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
>>> where nr_cpu_ids is huge.
>>> 
>>> Example from my experiment:
>>> 2 vCPU VM with hotplug support (up to 128):
>>> [    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
>>> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
>>> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
>>> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
>>> filled by ~75% in my experiment, so fragmentation is not a concern.
>>> Out of 12.8 GB:
>>>  - 0.2 GB are actually used by present vCPUs, and
>>>  - 12.6 GB are "wasted"!
>>> 
>>> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
>>> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
>>> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
>>> parameter. But it won't eliminate fundamental issue with "wasted" memory.
>>> 
>>> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
>>> units for existing chunks on CPU hotplug/remove event.
>>> 
>> Idk. In theory it sounds doable. In practice I'm not so sure. The two
>> problems off the top of my head:
>> 1) What happens if we can't allocate new pages when a cpu is onlined?
Simply - registering CPU can return error on allocation failure. Or potentially it can be reinstantiated later on memory availability if it’s the case.

>> 2) It's possible users set particular conditions in percpu variables
>> that are not tied to just statistics summing (such as the cpu
>> runqueues). Users would have to provide online init and exit functions
>> which could get weird.
I do not think online init/exit function is a right approach.
There are many places in the Linux where percpu data get initialized right after got allocated:
ptr = alloc_percpu();
for_each_possible_cpu(cpu) {
        initialize (per_cpu_ptr(ptr, cpu));
}
Let’s keep all such instances untouched. Hope initialize() just touch content of percpu area without allocating substructures. If so - it should be redesigned.
BTW, this loop does extra work (runtime overhead) to initialize areas for possible cpus which might never arrive.

The proposal:
 - in case of possible_cpus > online_cpus, add additional unit (call it A) to the chunks which will contain initialized image of percpu data for possible cpus.
 - for_each_possible_cpu(cpu) from snippet above should go through all online cpus + 1 (for unit A).
 - on new CPU #N arrival, percpu should allocate corresponding unit N and initialize its content by data from unit A. Repeat for all chunks.
 - on CPU D departure - release unit D from the chunks, keeping unit A intact.
 - in case of possible_cpus > online_cpus, overhead will be +1 (for unit A), while current overhead is +(possible_cpus-online_cpus).
 - in case of possible_cpus == online_cpus (no CPU hotplug) - do not allocate unit A, keep percpu allocator as it is now - no overhead.

Does it fully cover 2nd concern?

>> As Roman mentioned, I think it would be much better to not have the
>> large discrepancy between the cpu_online_mask and the cpu_possible_mask.
> 
> Indeed it is quite common on PowerPC to set a VM with a possible high number of CPUs but with a reasonnable number of online CPUs. This allows the user to scale up its VM when needed.
> 
> For instance we may see up to 1024 possible CPUs while the online number is *only* 128.
Agree. In VMs, vCPUs there are just threads/processes on the host and can be easily added/removed on demand.

Thanks,
—Alexey