Re: [PATCH v1 11/26] x86/sev: Invalidate pages from the direct map when adding them to the RMP table

Michael Roth <michael.roth@xxxxxxx> · Thu, 25 Jan 2024 19:35:24 -0600

On Tue, Jan 16, 2024 at 12:22:39PM -0800, Dave Hansen wrote:
> On 1/16/24 08:19, Michael Roth wrote:
> > 
> > So at the very least, if we went down this path, we would be worth
> > investigating the following areas in addition to general perf testing:
> > 
> >   1) Only splitting directmap regions corresponding to kernel-allocatable
> >      *data* (hopefully that's even feasible...)
> 
> Take a look at the 64-bit memory map in here:
> 
> 	https://www.kernel.org/doc/Documentation/x86/x86_64/mm.rst
> 
> We already have separate mappings for kernel data and (normal) kernel text.
> 
> >   2) Potentially deferring the split until an SNP guest is actually
> >      run, so there isn't any impact just from having SNP enabled (though
> >      you still take a hit from RMP checks in that case so maybe it's not
> >      worthwhile, but that itself has been noted as a concern for users
> >      so it would be nice to not make things even worse).
> 
> Yes, this would be nice too.
> 
> >> Actually, where _is_ the TLB flushing here?
> > Boris pointed that out in v6, and we implemented it in v7, but it
> > completely cratered performance:
> 
> That *desperately* needs to be documented.
> 
> How can it be safe to skip the TLB flush?  It this akin to a page
> permission promotion where you go from RO->RW but can skip the TLB
> flush?  In that case, the CPU will see the RO TLB entry violation, drop
> it, and re-walk the page tables, discovering the RW entry.
> 
> Does something similar happen here where the CPU sees the 2M/4k conflict
> in the TLB, drops the 2M entry, does a re-walk then picks up the
> newly-split 2M->4k entries?
> 
> I can see how something like that would work, but it's _awfully_ subtle
> to go unmentioned.

I think the current logic relies on the fact that we don't actually have
to remove entries from the RMP table to avoid unexpected RMP #PFs,
because the owners of private PFNs are aware of what access restrictions
would be active, and non-owners of private PFNs shouldn't be trying to
write to them. 

It's only cases where a non-owner does a write via a 2MB+ mapping to that
happens to overlap a private PFN that needs to be guarded against, and
when a private PFN is removed from the directmap it will end up
splitting the 2MB mapping, and in that cases there *is* a TLB flush that
would wipe out the 2MB TLB entry.

After that point, there may still be stale 4KB TLB entries that accrue,
but by the time there is any sensible reason to use them to perform a
write, the PFN would have been transitioned back to shared state and
re-added to the directmap.

But yes, it's ugly, and we've ended up re-working this to simply split
the directmap via set_memory_4k(), which also does the expected TLB
flushing of 2MB+ mappings, and we only call it when absolutely
necessary, benefitting from the following optimizations:

  - if lookup_address() tells us the mapping is already split, no work
    splitting is needed
  - when the rmpupdate() is for a 2MB private page/RMP entry, there's no
    need to split the directmap, because there's no risk of a non-owner's
    writes overlapping with the private PFNs (RMP checks are only done on
    4KB/2MB granularity, so even a write via a 1GB directmap mapping
    would not trigger an RMP fault since the non-owner's writes would be
    to a different 2MB PFN range.

Since KVM/gmem generally allocate 2MB backing pages for private guest
memory, the above optimizations result in the directmap only needing to
be split, and only needing to acquire the cpa_lock, a handful of times
for each guest.

This also sort of gives us what we were looking for earlier: a way to
defer the directmap split until it's actually needed. But rather than in
bulk, it's amortized over time, and the more it's done, the more likely it
is that the host is being used primarily to run SNP guests, and so the less
likely it is that splitting the directmap is going to cause some sort of
noticeable regression for non-SNP/non-guest workloads.

I did some basic testing to compare this against pre-splitting the
directmap, and I think it's pretty close performance-wise, so it seems like
a good, safe default. It's also fairly trivial to add a kernel cmdline
params or something later if someone wants to optimize specifically for
SNP guests.

  System: EPYC, 512 cores, 1.5TB memory

  == Page-in/acceptance rate for 1 guests, each with 128 vCPUs/512M ==

     directmap pre-split to 4K:
     13.77GB/s
     11.88GB/s
     15.37GB/s

     directmap split on-demand:
     14.77 GB/s
     16.57 GB/s
     15.53 GB/s

  == Page-in/acceptance rate for 4 guests, each with 128 vCPUs/256GB ==

     directmap pre-split to 4K:
     37.35 GB/s
     33.37 GB/s
     29.55 GB/s

     directmap split on-demand:
     28.88 GB/s
     25.13 GB/s
     29.28 GB/s

  == Page-in/acceptance rate for 8 guests, each with 128 vCPUs/160GB ==

     directmap pre-split to 4K:
     24.39 GB/s
     27.16 GB/s
     24.70 GB/s

     direct split on-demand:
     26.12 GB/s
     24.44 GB/s
     22.66 GB/s

So for v2 we're using this on-demand approach, and I've tried to capture
some of this rationale and in the commit log / comments for the relevant
patch.

Thanks,

Mike