On Tue, Jan 16, 2024 at 12:22:39PM -0800, Dave Hansen wrote: > On 1/16/24 08:19, Michael Roth wrote: > > > > So at the very least, if we went down this path, we would be worth > > investigating the following areas in addition to general perf testing: > > > > 1) Only splitting directmap regions corresponding to kernel-allocatable > > *data* (hopefully that's even feasible...) > > Take a look at the 64-bit memory map in here: > > https://www.kernel.org/doc/Documentation/x86/x86_64/mm.rst > > We already have separate mappings for kernel data and (normal) kernel text. > > > 2) Potentially deferring the split until an SNP guest is actually > > run, so there isn't any impact just from having SNP enabled (though > > you still take a hit from RMP checks in that case so maybe it's not > > worthwhile, but that itself has been noted as a concern for users > > so it would be nice to not make things even worse). > > Yes, this would be nice too. > > >> Actually, where _is_ the TLB flushing here? > > Boris pointed that out in v6, and we implemented it in v7, but it > > completely cratered performance: > > That *desperately* needs to be documented. > > How can it be safe to skip the TLB flush? It this akin to a page > permission promotion where you go from RO->RW but can skip the TLB > flush? In that case, the CPU will see the RO TLB entry violation, drop > it, and re-walk the page tables, discovering the RW entry. > > Does something similar happen here where the CPU sees the 2M/4k conflict > in the TLB, drops the 2M entry, does a re-walk then picks up the > newly-split 2M->4k entries? > > I can see how something like that would work, but it's _awfully_ subtle > to go unmentioned. I think the current logic relies on the fact that we don't actually have to remove entries from the RMP table to avoid unexpected RMP #PFs, because the owners of private PFNs are aware of what access restrictions would be active, and non-owners of private PFNs shouldn't be trying to write to them. It's only cases where a non-owner does a write via a 2MB+ mapping to that happens to overlap a private PFN that needs to be guarded against, and when a private PFN is removed from the directmap it will end up splitting the 2MB mapping, and in that cases there *is* a TLB flush that would wipe out the 2MB TLB entry. After that point, there may still be stale 4KB TLB entries that accrue, but by the time there is any sensible reason to use them to perform a write, the PFN would have been transitioned back to shared state and re-added to the directmap. But yes, it's ugly, and we've ended up re-working this to simply split the directmap via set_memory_4k(), which also does the expected TLB flushing of 2MB+ mappings, and we only call it when absolutely necessary, benefitting from the following optimizations: - if lookup_address() tells us the mapping is already split, no work splitting is needed - when the rmpupdate() is for a 2MB private page/RMP entry, there's no need to split the directmap, because there's no risk of a non-owner's writes overlapping with the private PFNs (RMP checks are only done on 4KB/2MB granularity, so even a write via a 1GB directmap mapping would not trigger an RMP fault since the non-owner's writes would be to a different 2MB PFN range. Since KVM/gmem generally allocate 2MB backing pages for private guest memory, the above optimizations result in the directmap only needing to be split, and only needing to acquire the cpa_lock, a handful of times for each guest. This also sort of gives us what we were looking for earlier: a way to defer the directmap split until it's actually needed. But rather than in bulk, it's amortized over time, and the more it's done, the more likely it is that the host is being used primarily to run SNP guests, and so the less likely it is that splitting the directmap is going to cause some sort of noticeable regression for non-SNP/non-guest workloads. I did some basic testing to compare this against pre-splitting the directmap, and I think it's pretty close performance-wise, so it seems like a good, safe default. It's also fairly trivial to add a kernel cmdline params or something later if someone wants to optimize specifically for SNP guests. System: EPYC, 512 cores, 1.5TB memory == Page-in/acceptance rate for 1 guests, each with 128 vCPUs/512M == directmap pre-split to 4K: 13.77GB/s 11.88GB/s 15.37GB/s directmap split on-demand: 14.77 GB/s 16.57 GB/s 15.53 GB/s == Page-in/acceptance rate for 4 guests, each with 128 vCPUs/256GB == directmap pre-split to 4K: 37.35 GB/s 33.37 GB/s 29.55 GB/s directmap split on-demand: 28.88 GB/s 25.13 GB/s 29.28 GB/s == Page-in/acceptance rate for 8 guests, each with 128 vCPUs/160GB == directmap pre-split to 4K: 24.39 GB/s 27.16 GB/s 24.70 GB/s direct split on-demand: 26.12 GB/s 24.44 GB/s 22.66 GB/s So for v2 we're using this on-demand approach, and I've tried to capture some of this rationale and in the commit log / comments for the relevant patch. Thanks, Mike