On 19 Mar 2024, at 22:42, kaiyang2@xxxxxxxxxx wrote: > From: Kaiyang Zhao <kaiyang2@xxxxxxxxxx> > > Memory capacity has increased dramatically over the last decades. > Meanwhile, TLB capacity has stagnated, causing a significant virtual > address translation overhead. As a collaboration between Carnegie Mellon > University and Meta, we investigated the issue at Meta’s datacenters and > found that about 20% of CPU cycles are spent doing page walks [1], and > similar results are also reported by Google [2]. > > To tackle the overhead, we need widespread uses of huge pages. And huge > pages, when they can actually be created, work wonders: they provide up > to 18% higher performance for Meta’s production workloads in our > experiments [1]. > > However, we observed that huge pages through THP are unreliable because > sufficient physical contiguity may not exist and compaction to recover > from memory fragmentation frequently fails. To ensure workloads get a > reasonable number of huge pages, Meta could not rely on THP and had to > use reserved huge pages. Proposals to add 1GB THP support [5] are even > more dependent on ample availability of physical contiguity. > > A major reason for the lack of physical contiguity is the mixing of > unmovable and movable allocations, causing compaction to fail. Quoting > from [3], “in a broad sample of Meta servers, we find that unmovable > allocations make up less than 7% of total memory on average, yet occupy > 34% of the 2M blocks in the system. We also found that this effect isn't > correlated with high uptimes, and that servers can get heavily > fragmented within the first hour of running a workload.” > > Our proposed solution is to confine the unmovable allocations to a > separate region in physical memory. We experimented with using a CMA > region for the movable allocations, but in this version we use > ZONE_MOVABLE for movable and all other zones for unmovable allocations. > Movable allocations can temporarily reside in the unmovable zones, but > will be proactively moved out by compaction. > > To resize ZONE_MOVABLE, we still rely on memory hotplug interfaces. We > export the number of pages scanned on behalf of movable or unmovable > allocations during reclaim to approximate the memory pressure in two > parts of physical memory, and a userspace tool can monitor the metrics > and make resizing decisions. Previously we augmented the PSI interface > to break down memory pressure into movable and unmovable allocation > types, but that approach enlarges the scheduler cacheline footprint. > From our preliminary observations, just looking at the per-allocation > type scanned counters and with a little tuning, it is sufficient to tell > if there is not enough memory for unmovable allocations and make > resizing decisions. > > This patch extends the idea of migratetype isolation at pageblock > granularity posted earlier [3] by Johannes Weiner to an > as-large-as-needed region to better support huge pages of bigger sizes > and hardware TLB coalescing. We’re looking for feedback on the overall > direction, particularly in relation to the recent THP allocator > optimization proposal [4]. > > The patches are based on 6.4 and are also available on github at > https://github.com/magickaiyang/kernel-contiguous/tree/per_alloc_type_reclaim_counters_oct052023 Your reference links (1 to 4) are missing. -- Best Regards, Yan, Zi
Attachment:
signature.asc
Description: OpenPGP digital signature