[Cc Johannes, Shakeel and Nico] On Thu 09-12-21 00:36:53, Alexey Makhalov wrote: > Hello, > > I use Vmware VM with the following configuration. > - 1 vCPU per 1 NUMA node > - 4 online vCPUs, 128 possible vCPUs. It translates to > - 4 online nodes and 128 possible nodes. > - 192VM memory We have discussed this particular setup in another email thread that is not related to this particular issue but let me just repeat that I conside such a configuration rather surprising and suboptimal. I am not sure what will be the actual topology but a single CPU per NUMA node will have some interesting side effects (e.g. CPU load balancing etc). Also too many memory&cpu less nodes is not something many kernel subsystems are optimized for. At best we are trying to avoid MAX_NUMNODES scaling and going with the possible nodes. Proper handling of possible nodes without memory requires memory hotplug notifiers and synchronization. > Linux 5.15 with CONFIG_NODES_SHIFT=6 complains on node numbers more > that maximum supported (1 << 6): > Nov 27 01:59:37 photon-576f8974caf.org kernel: SRAT: PXM 62 -> APIC 0x7c -> Node 62 > Nov 27 01:59:37 photon-576f8974caf.org kernel: SRAT: PXM 63 -> APIC 0x7e -> Node 63 > Nov 27 01:59:37 photon-576f8974caf.org kernel: SRAT: Too many proximity domains 40 > Nov 27 01:59:37 photon-576f8974caf.org kernel: ACPI: SRAT: SRAT not used. > Nov 27 01:59:37 photon-576f8974caf.org kernel: No NUMA configuration found > > But it boots OK and Percpu memory amount is 53760 kB > > If I compile with CONFIG_NODES_SHIFT=10 to support 128 nodes, boot warning disappears, > cpu info reports proper numa nodes for existing cpus. > But boot process fails with OOM in pid 1. > > Increasing VM RAM from 192 MB to 1024MB fixed OOM. > /proc/meminfo reported increase in Percpu to 718048 kB !! > > It sounds surprising as number of CPUs are the same in both cases. > > Initial analysis showed that each memory cgroup allocates per node structures. Each of > them have percpu allocations, doing 128 * 128 * struct size. > See: mem_cgroup_alloc() -> alloc_mem_cgroup_per_node_info() > > There is also old comment about it in alloc_mem_cgroup_per_node_info() > /* > * This routine is called against possible nodes. > * But it's BUG to call kmalloc() against offline node. > * > * TODO: this routine can waste much memory for nodes which will > * never be onlined. It's better to use memory hotplug callback > * function. > */ > There are might be other places not efficiently using memory for non existing nodes. Yes, another example would be shrinkers: see http://lkml.kernel.org/r/aa8a8deb-0fdb-9408-48d4-adadb5602d72@xxxxxxxxxx -- Michal Hocko SUSE Labs