Re: [PATCH v1 11/26] x86/sev: Invalidate pages from the direct map when adding them to the RMP table

Michael Roth <michael.roth@xxxxxxx> · Tue, 16 Jan 2024 10:50:25 -0600

On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote:
> I did some performance tests which do seem to indicate that
> pre-splitting the directmap to 4K can be substantially improve certain
> SNP guest workloads. This test involves running a single 1TB SNP guest
> with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to
> rapidly fault in all of its memory via lazy acceptance, and then
> measuring the rate that gmem pages are being allocated on the host by
> monitoring "FileHugePages" from /proc/meminfo to get some rough gauge
> of how quickly a guest can fault in it's initial working set prior to
> reaching steady state. The data is a bit noisy but seems to indicate
> significant improvement by taking the directmap updates out of the
> lazy acceptance path, and I would only expect that to become more
> significant as you scale up the number of guests / vCPUs.
> 
>   # Average fault-in rate across 3 runs, measured in GB/s
>                     unpinned | pinned to NUMA node 0
>   DirectMap4K           12.9 | 12.1
>              stddev      2.2 |  1.3
>   DirectMap2M+split      8.0 |  8.9
>              stddev      1.3 |  0.8
> 
> The downside of course is potential impact for non-SNP workloads
> resulting from splitting the directmap. Mike Rapoport's numbers make
> me feel a little better about it, but I don't think they apply directly
> to the notion of splitting the entire directmap. It's Even he LWN article
> summarizes:
> 
>   "The conclusion from all of this, Rapoport continued, was that
>   direct-map fragmentation just does not matter — for data access, at
>   least. Using huge-page mappings does still appear to make a difference
>   for memory containing the kernel code, so allocator changes should
>   focus on code allocations — improving the layout of allocations for
>   loadable modules, for example, or allowing vmalloc() to allocate huge
>   pages for code. But, for kernel-data allocations, direct-map
>   fragmentation simply appears to not be worth worrying about."
> 
> So at the very least, if we went down this path, we would be worth
> investigating the following areas in addition to general perf testing:
> 
>   1) Only splitting directmap regions corresponding to kernel-allocatable
>      *data* (hopefully that's even feasible...)
>   2) Potentially deferring the split until an SNP guest is actually
>      run, so there isn't any impact just from having SNP enabled (though
>      you still take a hit from RMP checks in that case so maybe it's not
>      worthwhile, but that itself has been noted as a concern for users
>      so it would be nice to not make things even worse).

There's another potential area of investigation I forgot to mention that
doesn't involve pre-splitting the directmap. It makes use of the fact
that the kernel should never be accessing a 2MB mapping that overlaps with
private guest memory if the backing PFN for the guest memory is a 2MB page.
Since there's no chance for overlap (well, maybe via a 1GB directmap entry,
but not as dramatic a change to force those to 2M), there's no need to
actually split the directmap entry in these cases since they won't
result in unexpected RMP faults.

So if pre-splitting the directmap ends up having too many downsides, then
there may still some potential for optimizing the current approach to a
fair degree.

-Mike