On Fri, Jan 12, 2024 at 12:37:31PM -0800, Dave Hansen wrote: > On 1/12/24 12:28, Tom Lendacky wrote: > > I thought there was also a desire to remove the direct map for any pages > > assigned to a guest as private, not just the case that the comment says. > > So updating the comment would probably the best action. > > I'm not sure who desires that. > > It's sloooooooow to remove things from the direct map. There's almost > certainly a frequency cutoff where running the whole direct mapping as > 4k is better than the cost of mapping/unmapping. One area we'd been looking to optimize[1] is lots of vCPUs/guests faulting in private memory on-demand via kernels with lazy acceptance support. The lazy acceptance / #NPF path for each 4K/2M guest page involves updating the corresponding RMP table entry to set it to Guest-owned state, and as part of that we remove the PFNs from the directmap. There is indeed potential for scalability issues due to how directmap updates are currently handled, since they involve holding a global cpa_lock for every update. At the time, there were investigations on how to remove the cpa_lock, and there's an RFC patch[2] that implements this change, but I don't know where this stands today. There was also some work[3] on being able to restore 2MB entries in the directmap (currently entries are only re-added as 4K) the Mike Rapoport pointed out to me a while back. With that, since the bulk of private guest pages 2MB, we'd be able to avoid splitting the directmap to 4K over time. There was also some indication[4] that UPM/guest_memfd would eventually manage directmap invalidations internally, so it made sense to have SNP continue to handle this up until the point that mangement was moved to gmem. Those 3 things, paired with a platform-independent way of catching unexpected kernel accesses to private guest memory, are what I think nudged us all toward the current implementation. But AFAIK none of these 3 things are being actively upstreamed today, so it makes sense to re-consider how we handle the directmap in the interim. I did some performance tests which do seem to indicate that pre-splitting the directmap to 4K can be substantially improve certain SNP guest workloads. This test involves running a single 1TB SNP guest with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to rapidly fault in all of its memory via lazy acceptance, and then measuring the rate that gmem pages are being allocated on the host by monitoring "FileHugePages" from /proc/meminfo to get some rough gauge of how quickly a guest can fault in it's initial working set prior to reaching steady state. The data is a bit noisy but seems to indicate significant improvement by taking the directmap updates out of the lazy acceptance path, and I would only expect that to become more significant as you scale up the number of guests / vCPUs. # Average fault-in rate across 3 runs, measured in GB/s unpinned | pinned to NUMA node 0 DirectMap4K 12.9 | 12.1 stddev 2.2 | 1.3 DirectMap2M+split 8.0 | 8.9 stddev 1.3 | 0.8 The downside of course is potential impact for non-SNP workloads resulting from splitting the directmap. Mike Rapoport's numbers make me feel a little better about it, but I don't think they apply directly to the notion of splitting the entire directmap. It's Even he LWN article summarizes: "The conclusion from all of this, Rapoport continued, was that direct-map fragmentation just does not matter — for data access, at least. Using huge-page mappings does still appear to make a difference for memory containing the kernel code, so allocator changes should focus on code allocations — improving the layout of allocations for loadable modules, for example, or allowing vmalloc() to allocate huge pages for code. But, for kernel-data allocations, direct-map fragmentation simply appears to not be worth worrying about." So at the very least, if we went down this path, we would be worth investigating the following areas in addition to general perf testing: 1) Only splitting directmap regions corresponding to kernel-allocatable *data* (hopefully that's even feasible...) 2) Potentially deferring the split until an SNP guest is actually run, so there isn't any impact just from having SNP enabled (though you still take a hit from RMP checks in that case so maybe it's not worthwhile, but that itself has been noted as a concern for users so it would be nice to not make things even worse). [1] https://lore.kernel.org/linux-mm/20231103000105.m3z4eijcxlxciyzd@xxxxxxx/ [2] https://lore.kernel.org/lkml/Y7f9ZuPcIMk37KnN@xxxxxxxxx/T/#m15b74841f5319c0d1177f118470e9714d4ea96c8 [3] https://lore.kernel.org/linux-kernel/20200416213229.19174-1-kirill.shutemov@xxxxxxxxxxxxxxx/ [4] https://lore.kernel.org/all/YyGLXXkFCmxBfu5U@xxxxxxxxxx/ > > Actually, where _is_ the TLB flushing here? Boris pointed that out in v6, and we implemented it in v7, but it completely cratered performance: https://lore.kernel.org/linux-mm/20221219150026.bltiyk72pmdc2ic3@xxxxxxx/ After further discussion I think we'd concluded it wasn't necessary. Maybe that's worth revisiting though. If it is necessary, then that would be another reason to just pre-split the directmap because the above-mentioned lazy acceptance workload/bottleneck would likely get substantially worse. -Mike