Re: [RFC Patch 1/2] KVM: SVM: Create SEV cgroup controller.

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Tue, 3 Nov 2020 10:10:09 -0800

On Tue, Nov 03, 2020 at 08:39:12AM -0800, James Bottomley wrote:
> On Mon, 2020-09-21 at 18:22 -0700, Sean Christopherson wrote:
> > ASIDs too.  I'd also love to see more info in the docs and/or cover
> > letter to explain why ASID management on SEV requires a cgroup.  I
> > know what an ASID is, and have a decent idea of how KVM manages ASIDs
> > for legacy VMs, but I know nothing about why ASIDs are limited for
> > SEV and not legacy VMs.
> 
> Well, also, why would we only have a cgroup for ASIDs but not MSIDs?

Assuming MSID==PCID in Intel terminology, which may be a bad assumption, the
answer is that rationing PCIDs is a fools errand, at least on Intel CPUs.

> For the reader at home a Space ID (SID) is simply a tag that can be
> placed on a cache line to control things like flushing.  Intel and AMD
> use MSIDs which are allocated per process to allow fast context
> switching by flushing all the process pages using a flush by SID. 
> ASIDs are also used by both Intel and AMD to control nested/extended
> paging of virtual machines, so ASIDs are allocated per VM.  So far it's
> universal.

On Intel CPUs, multiple things factor into the actual ASID that is used to tag
TLB entries.  And underneath the hood, there are a _very_ limited number of
ASIDs that are globally shared, i.e. a process in the host has an ASID, same
as a process in the guest, and the CPU only supports tagging translations for
N ASIDs at any given time.

E.g. with TDX, the hardware/real ASID is derived from:

   VPID + PCID + SEAM + EPTP

where VPID=0 for host, PCID=0 if PCID is disabled, SEAM=1 for the TDX-Module
and TDX VMs, and obviously EPTP is invalid/ignored when EPT is disabled.

> AMD invented a mechanism for tying their memory encryption technology
> to the ASID asserted on the memory bus, so now they can do encrypted
> virtual machines since each VM is tagged by ASID which the memory
> encryptor sees.  It is suspected that the forthcoming intel TDX
> technology to encrypt VMs will operate in the same way as well.  This

TDX uses MKTME keys, which are not tied to the ASID.  The KeyID is part of the
physical address, at least in the initial hardware implementations, which means
that from a memory perspective, each KeyID is a unique physical address.  This
is completely orthogonal to ASIDs, e.g. a given KeyID+PA combo can have
mutliple TLB entries if it's accessed by multiple ASIDs.

> isn't everything you have to do to get an encrypted VM, but it's a core
> part of it.
> 
> The problem with SIDs (both A and M) is that they get crammed into
> spare bits in the CPU (like the upper bits of %CR3 for MSID) so we

This CR3 reference is why I assume MSID==PCID, but the PCID is carved out of
the lower bits (11:0) of CR3, which is why I'm unsure I interpreted this
correctly.

> don't have enough of them to do a 1:1 mapping of MSID to process or
> ASID to VM.  Thus we have to ration them somewhat, which is what I
> assume this patch is about?

This cgroup is more about a hard limitation than about performance.

With PCIDs, VPIDs, and AMD's ASIDs, there is always the option of recycling an
existing ID (used for PCIDs and ASIDs), or simply disabling the feature (used
for VPIDs).  In both cases, exhausting the resource affects performance due to
incurring TLB flushes at transition points, but doesn't prevent creating new
processes/VMs.

And due to the way PCID=>ASID derivation works on Intel CPUs, the kernel
doesn't even bother trying to use a large number of PCIDs.  IIRC, the current
number of PCIDs used by the kernel is 5, i.e. the kernel intentionally
recycles PCIDs long before it's forced to do so by the architectural
limitation of 4k PCIDs, because using more than 5 PCIDs actually hurts
performance (forced PCID recycling allows the kernel to keep *its* ASID live
by flushing userspace PCIDs, whereas CPU recycling of ASIDs is indiscriminate).

MKTME KeyIDs and SEV ASIDs are different.  There is a hard, relatively low
limit on the number of IDs that are available, and exhausting that pool
effectively prevents creating a new encrypted VM[*].  E.g. with TDX, on first
gen hardware there is a hard limit of 127 KeyIDs that can be used to create
TDX VMs.  IIRC, SEV-ES is capped 512 or so ASIDs.  Hitting that cap means no
more protected VMs can be created.

[*] KeyID exhaustion for TDX is a hard restriction, the old VM _must_ be torn
    down to reuse the KeyID.  ASID exhaustion for SEV is not technically a
    hard limit, e.g. KVM could theoretically park a VM to reuse its ASID, but
    for all intents and purposes that VM is no longer live.