Hi, We are seeing an increased slab memory consumption on PowerPC guest LPAR (on PowerVM) having an uncommon topology where one NUMA node has no CPUs or any memory and the other node has all the CPUs and memory. Though QEMU prevents such topologies for KVM guest, I hacked QEMU to allow such topology to get some slab numbers. Here is the comparision of such a KVM guest with a single node KVM guest with equal amount of CPUs and memory. Case 1: 2 node NUMA, node0 empty ================================ # numactl -H available: 2 nodes (0-1) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 0 1 2 3 4 5 6 7 node 1 size: 16294 MB node 1 free: 15453 MB node distances: node 0 1 0: 10 40 1: 40 10 Case 2: Single node =================== # numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 16294 MB node 0 free: 15675 MB node distances: node 0 0: 10 Here is how the total slab memory consumptions compare right after boot: # grep -i slab /proc/meminfo Case 1: 442560 kB Case 2: 195904 kB Closer look at the individual slabs suggests that most of the increased slab consumption in Case 1 can be attributed to kmalloc-N slabs. In particular the following two caches account for most of the increase. Case 1: # ./slabinfo -S | grep -e kmalloc-1k -e kmalloc-512 kmalloc-1k 2869 1024 101.5M 1549/1540/0 32 0 99 2 U kmalloc-512 3302 512 100.2M 1530/1522/0 64 0 99 1 U Case 2: # ./slabinfo -S | grep -e kmalloc-1k -e kmalloc-512 kmalloc-1k 2811 1024 6.1M 94/29/0 32 0 30 46 U kmalloc-512 3207 512 3.5M 54/13/0 64 0 24 46 U Here is the list of slub stats that significantly differ between two cases: Case 1: ------ alloc_from_partial 6333 C0=1506 C1=525 C2=774 C3=478 C4=413 C5=1036 C6=698 C7=903 alloc_slab 3350 C0=757 C1=336 C2=120 C3=72 C4=120 C5=912 C6=600 C7=433 alloc_slowpath 9792 C0=2329 C1=861 C2=916 C3=571 C4=533 C5=1948 C6=1298 C7=1336 cmpxchg_double_fail 31 C1=3 C2=2 C3=7 C4=3 C5=4 C6=2 C7=10 deactivate_full 38 C0=14 C1=2 C2=13 C5=3 C6=2 C7=4 deactivate_remote_frees 1 C7=1 deactivate_to_head 10092 C0=2654 C1=859 C2=903 C3=571 C4=533 C5=1945 C6=1296 C7=1331 deactivate_to_tail 1 C7=1 free_add_partial 29 C0=7 C2=1 C3=5 C4=3 C5=6 C6=2 C7=5 free_frozen 32 C0=4 C1=3 C2=4 C3=3 C4=7 C5=3 C6=7 C7=1 free_remove_partial 1799 C0=408 C1=47 C2=54 C4=231 C5=752 C6=110 C7=197 free_slab 1799 C0=408 C1=47 C2=54 C4=231 C5=752 C6=110 C7=197 free_slowpath 7415 C0=2014 C1=486 C2=433 C3=525 C4=814 C5=1707 C6=586 C7=850 objects 2875 N1=2875 objects_partial 2587 N1=2587 partial 1542 N1=1542 slabs 1551 N1=1551 total_objects 49632 N1=49632 # cat alloc_calls (truncated) 1952 alloc_fair_sched_group+0x114/0x240 age=147813/152837/153714 pid=1-1074 cpus=0-2,5-7 nodes=1 # cat free_calls (truncated) 2671 <not-available> age=4295094831 pid=0 cpus=0 nodes=1 2 free_fair_sched_group+0xa0/0x120 age=156576/156850/157125 pid=0 cpus=0,5 nodes=1 Case 1: ------ alloc_from_partial 9231 C0=435 C1=2349 C2=2386 C3=1807 C4=882 C5=367 C6=559 C7=446 alloc_slab 114 C0=12 C1=41 C2=28 C3=15 C4=9 C5=1 C6=1 C7=7 alloc_slowpath 9415 C0=448 C1=2390 C2=2414 C3=1891 C4=891 C5=368 C6=560 C7=453 cmpxchg_double_fail 22 C0=1 C1=1 C3=3 C4=8 C5=1 C6=5 C7=3 deactivate_full 512 C0=13 C1=143 C2=147 C3=147 C4=22 C5=10 C6=6 C7=24 deactivate_remote_frees 1 C4=1 deactivate_to_head 9099 C0=437 C1=2247 C2=2267 C3=1937 C4=870 C5=358 C6=554 C7=429 deactivate_to_tail 1 C4=1 free_add_partial 447 C0=21 C1=140 C2=164 C3=60 C4=22 C5=16 C6=14 C7=10 free_frozen 22 C0=3 C2=3 C3=2 C4=1 C5=6 C6=6 C7=1 free_remove_partial 20 C1=5 C2=5 C4=3 C6=7 free_slab 20 C1=5 C2=5 C4=3 C6=7 free_slowpath 6953 C0=194 C1=2123 C2=1729 C3=850 C4=466 C5=725 C6=520 C7=346 objects 2812 N0=2812 objects_partial 733 N0=733 partial 29 N0=29 slabs 94 N0=94 total_objects 3008 N0=3008 # cat alloc_calls (truncated) 1952 alloc_fair_sched_group+0x114/0x240 age=43957/46225/46802 pid=1-1059 cpus=1-5,7 # cat free_calls (truncated) 1516 <not-available> age=4294987281 pid=0 cpus=0 647 free_fair_sched_group+0xa0/0x120 age=48798/49142/49628 pid=0-954 cpus=1-2 We see a significant difference in the number of partial slabs and the resulting total_objects between the two cases. I was trying to see if this has got to do anything with the way the node value is arrived at in difference slub routines. Haven't yet understood slub code to say anything conclusively, but the following hack in the slub code completely reduces the increased slab consumption for Case1 and makes it very similar to Case2 diff --git a/mm/slub.c b/mm/slub.c index 17dc00e33115..888e4d245444 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, void *object; int searchnode = node; - if (node == NUMA_NO_NODE) + if (node == NUMA_NO_NODE || !node_present_pages(node)) searchnode = numa_mem_id(); - else if (!node_present_pages(node)) - searchnode = node_to_mem_node(node); object = get_partial_node(s, get_node(s, searchnode), c, flags); if (object || node != NUMA_NO_NODE) Regards, Bharata.