From: Kaiyang Zhao <kaiyang2@xxxxxxxxxx> Memory capacity has increased dramatically over the last decades. Meanwhile, TLB capacity has stagnated, causing a significant virtual address translation overhead. As a collaboration between Carnegie Mellon University and Meta, we investigated the issue at Meta’s datacenters and found that about 20% of CPU cycles are spent doing page walks [1], and similar results are also reported by Google [2]. To tackle the overhead, we need widespread uses of huge pages. And huge pages, when they can actually be created, work wonders: they provide up to 18% higher performance for Meta’s production workloads in our experiments [1]. However, we observed that huge pages through THP are unreliable because sufficient physical contiguity may not exist and compaction to recover from memory fragmentation frequently fails. To ensure workloads get a reasonable number of huge pages, Meta could not rely on THP and had to use reserved huge pages. Proposals to add 1GB THP support [5] are even more dependent on ample availability of physical contiguity. A major reason for the lack of physical contiguity is the mixing of unmovable and movable allocations, causing compaction to fail. Quoting from [3], “in a broad sample of Meta servers, we find that unmovable allocations make up less than 7% of total memory on average, yet occupy 34% of the 2M blocks in the system. We also found that this effect isn't correlated with high uptimes, and that servers can get heavily fragmented within the first hour of running a workload.” Our proposed solution is to confine the unmovable allocations to a separate region in physical memory. We experimented with using a CMA region for the movable allocations, but in this version we use ZONE_MOVABLE for movable and all other zones for unmovable allocations. Movable allocations can temporarily reside in the unmovable zones, but will be proactively moved out by compaction. To resize ZONE_MOVABLE, we still rely on memory hotplug interfaces. We export the number of pages scanned on behalf of movable or unmovable allocations during reclaim to approximate the memory pressure in two parts of physical memory, and a userspace tool can monitor the metrics and make resizing decisions. Previously we augmented the PSI interface to break down memory pressure into movable and unmovable allocation types, but that approach enlarges the scheduler cacheline footprint. >From our preliminary observations, just looking at the per-allocation type scanned counters and with a little tuning, it is sufficient to tell if there is not enough memory for unmovable allocations and make resizing decisions. This patch extends the idea of migratetype isolation at pageblock granularity posted earlier [3] by Johannes Weiner to an as-large-as-needed region to better support huge pages of bigger sizes and hardware TLB coalescing. We’re looking for feedback on the overall direction, particularly in relation to the recent THP allocator optimization proposal [4]. The patches are based on 6.4 and are also available on github at https://github.com/magickaiyang/kernel-contiguous/tree/per_alloc_type_reclaim_counters_oct052023 Kaiyang Zhao (7): sysfs interface for the boundary of movable zone Disallows high-order movable allocations in other zones if ZONE_MOVABLE is populated compaction accepts a destination zone vmstat counter for pages migrated across zones proactively move pages out of unmovable zones in kcompactd pass gfp mask of the allocation that waked kswapd to track number of pages scanned on behalf of each alloc type exports the number of pages scanned on behalf of movable/unmovable allocations drivers/base/memory.c | 2 +- drivers/base/node.c | 32 ++++++ include/linux/compaction.h | 4 +- include/linux/memory.h | 1 + include/linux/mmzone.h | 1 + include/linux/vm_event_item.h | 6 + mm/compaction.c | 209 ++++++++++++++++++++++++++-------- mm/internal.h | 1 + mm/page_alloc.c | 10 ++ mm/vmscan.c | 28 ++++- mm/vmstat.c | 14 ++- 11 files changed, 249 insertions(+), 59 deletions(-) -- 2.40.1