On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote: > I did some performance tests which do seem to indicate that > pre-splitting the directmap to 4K can be substantially improve certain > SNP guest workloads. This test involves running a single 1TB SNP guest > with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to > rapidly fault in all of its memory via lazy acceptance, and then > measuring the rate that gmem pages are being allocated on the host by > monitoring "FileHugePages" from /proc/meminfo to get some rough gauge > of how quickly a guest can fault in it's initial working set prior to > reaching steady state. The data is a bit noisy but seems to indicate > significant improvement by taking the directmap updates out of the > lazy acceptance path, and I would only expect that to become more > significant as you scale up the number of guests / vCPUs. > > # Average fault-in rate across 3 runs, measured in GB/s > unpinned | pinned to NUMA node 0 > DirectMap4K 12.9 | 12.1 > stddev 2.2 | 1.3 > DirectMap2M+split 8.0 | 8.9 > stddev 1.3 | 0.8 > > The downside of course is potential impact for non-SNP workloads > resulting from splitting the directmap. Mike Rapoport's numbers make > me feel a little better about it, but I don't think they apply directly > to the notion of splitting the entire directmap. It's Even he LWN article > summarizes: > > "The conclusion from all of this, Rapoport continued, was that > direct-map fragmentation just does not matter — for data access, at > least. Using huge-page mappings does still appear to make a difference > for memory containing the kernel code, so allocator changes should > focus on code allocations — improving the layout of allocations for > loadable modules, for example, or allowing vmalloc() to allocate huge > pages for code. But, for kernel-data allocations, direct-map > fragmentation simply appears to not be worth worrying about." > > So at the very least, if we went down this path, we would be worth > investigating the following areas in addition to general perf testing: > > 1) Only splitting directmap regions corresponding to kernel-allocatable > *data* (hopefully that's even feasible...) > 2) Potentially deferring the split until an SNP guest is actually > run, so there isn't any impact just from having SNP enabled (though > you still take a hit from RMP checks in that case so maybe it's not > worthwhile, but that itself has been noted as a concern for users > so it would be nice to not make things even worse). There's another potential area of investigation I forgot to mention that doesn't involve pre-splitting the directmap. It makes use of the fact that the kernel should never be accessing a 2MB mapping that overlaps with private guest memory if the backing PFN for the guest memory is a 2MB page. Since there's no chance for overlap (well, maybe via a 1GB directmap entry, but not as dramatic a change to force those to 2M), there's no need to actually split the directmap entry in these cases since they won't result in unexpected RMP faults. So if pre-splitting the directmap ends up having too many downsides, then there may still some potential for optimizing the current approach to a fair degree. -Mike