Re: [PATCH v1 11/26] x86/sev: Invalidate pages from the direct map when adding them to the RMP table

Michael Roth <michael.roth@xxxxxxx> · Tue, 16 Jan 2024 10:19:09 -0600

On Fri, Jan 12, 2024 at 12:37:31PM -0800, Dave Hansen wrote:
> On 1/12/24 12:28, Tom Lendacky wrote:
> > I thought there was also a desire to remove the direct map for any pages
> > assigned to a guest as private, not just the case that the comment says.
> > So updating the comment would probably the best action.
> 
> I'm not sure who desires that.
> 
> It's sloooooooow to remove things from the direct map.  There's almost
> certainly a frequency cutoff where running the whole direct mapping as
> 4k is better than the cost of mapping/unmapping.

One area we'd been looking to optimize[1] is lots of vCPUs/guests
faulting in private memory on-demand via kernels with lazy acceptance
support. The lazy acceptance / #NPF path for each 4K/2M guest page
involves updating the corresponding RMP table entry to set it to
Guest-owned state, and as part of that we remove the PFNs from the
directmap. There is indeed potential for scalability issues due to how
directmap updates are currently handled, since they involve holding a
global cpa_lock for every update.

At the time, there were investigations on how to remove the cpa_lock,
and there's an RFC patch[2] that implements this change, but I don't
know where this stands today.

There was also some work[3] on being able to restore 2MB entries in the
directmap (currently entries are only re-added as 4K) the Mike Rapoport
pointed out to me a while back. With that, since the bulk of private
guest pages 2MB, we'd be able to avoid splitting the directmap to 4K over
time.

There was also some indication[4] that UPM/guest_memfd would eventually
manage directmap invalidations internally, so it made sense to have
SNP continue to handle this up until the point that mangement was
moved to gmem.

Those 3 things, paired with a platform-independent way of catching
unexpected kernel accesses to private guest memory, are what I think
nudged us all toward the current implementation.

But AFAIK none of these 3 things are being actively upstreamed today,
so it makes sense to re-consider how we handle the directmap in the
interim.

I did some performance tests which do seem to indicate that
pre-splitting the directmap to 4K can be substantially improve certain
SNP guest workloads. This test involves running a single 1TB SNP guest
with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to
rapidly fault in all of its memory via lazy acceptance, and then
measuring the rate that gmem pages are being allocated on the host by
monitoring "FileHugePages" from /proc/meminfo to get some rough gauge
of how quickly a guest can fault in it's initial working set prior to
reaching steady state. The data is a bit noisy but seems to indicate
significant improvement by taking the directmap updates out of the
lazy acceptance path, and I would only expect that to become more
significant as you scale up the number of guests / vCPUs.

  # Average fault-in rate across 3 runs, measured in GB/s
                    unpinned | pinned to NUMA node 0
  DirectMap4K           12.9 | 12.1
             stddev      2.2 |  1.3
  DirectMap2M+split      8.0 |  8.9
             stddev      1.3 |  0.8

The downside of course is potential impact for non-SNP workloads
resulting from splitting the directmap. Mike Rapoport's numbers make
me feel a little better about it, but I don't think they apply directly
to the notion of splitting the entire directmap. It's Even he LWN article
summarizes:

  "The conclusion from all of this, Rapoport continued, was that
  direct-map fragmentation just does not matter — for data access, at
  least. Using huge-page mappings does still appear to make a difference
  for memory containing the kernel code, so allocator changes should
  focus on code allocations — improving the layout of allocations for
  loadable modules, for example, or allowing vmalloc() to allocate huge
  pages for code. But, for kernel-data allocations, direct-map
  fragmentation simply appears to not be worth worrying about."

So at the very least, if we went down this path, we would be worth
investigating the following areas in addition to general perf testing:

  1) Only splitting directmap regions corresponding to kernel-allocatable
     *data* (hopefully that's even feasible...)
  2) Potentially deferring the split until an SNP guest is actually
     run, so there isn't any impact just from having SNP enabled (though
     you still take a hit from RMP checks in that case so maybe it's not
     worthwhile, but that itself has been noted as a concern for users
     so it would be nice to not make things even worse).

[1] https://lore.kernel.org/linux-mm/20231103000105.m3z4eijcxlxciyzd@xxxxxxx/
[2] https://lore.kernel.org/lkml/Y7f9ZuPcIMk37KnN@xxxxxxxxx/T/#m15b74841f5319c0d1177f118470e9714d4ea96c8
[3] https://lore.kernel.org/linux-kernel/20200416213229.19174-1-kirill.shutemov@xxxxxxxxxxxxxxx/
[4] https://lore.kernel.org/all/YyGLXXkFCmxBfu5U@xxxxxxxxxx/

> 
> Actually, where _is_ the TLB flushing here?

Boris pointed that out in v6, and we implemented it in v7, but it
completely cratered performance:

   https://lore.kernel.org/linux-mm/20221219150026.bltiyk72pmdc2ic3@xxxxxxx/

After further discussion I think we'd concluded it wasn't necessary. Maybe
that's worth revisiting though. If it is necessary, then that would be
another reason to just pre-split the directmap because the above-mentioned
lazy acceptance workload/bottleneck would likely get substantially worse.

-Mike