Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 14 Jun 2024 17:04:57 -0700

On Fri, Jun 14, 2024, Kai Huang wrote:
> On Tue, 2024-06-04 at 10:48 +0000, Huang, Kai wrote:
> > On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote:
> > > On Thu, May 30, 2024, Kai Huang wrote:
> > > > On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote:
> > > > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being
> > > > > less than KVM's minimum, then we can update KVM's minimum as needed.  But AFAICT,
> > > > > that's purely theoretical at this point, i.e. this is all much ado about nothing.
> > > > 
> > > > I am afraid we already have a legitimate case: TD partitioning.  Isaku
> > > > told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD
> > > > partitioning supported.  And again this is static, i.e., doesn't require
> > > > TD partitioning to be opt-in to low to 512.
> > > 
> > > So what's Intel's plan for use cases that creates TDs with >512 vCPUs?
> > 
> > I checked with TDX module guys.  Turns out the 'max_vcpus_per_td' wasn't
> > introduced because of TD partitioning, and they are not actually related.
> > 
> > They introduced this to support "topology virtualization", which requires
> > a table to record the X2APIC IDs for all vcpus for each TD.  In practice,
> > given a TDX module, the 'max_vcpus_per_td', a.k.a, the X2APIC ID table
> > size reflects the physical logical cpus that *ALL* platforms that the
> > module supports can possibly have.
> > 
> > The reason of this design is TDX guys don't believe there's sense in
> > supporting the case where the 'max_vcpus' for one single TD needs to
> > exceed the physical logical cpus.
> > 
> > So in short:
> > 
> > - The "max_vcpus_per_td" can be different depending on module versions. In
> > practice it reflects the maximum physical logical cpus that all the
> > platforms (that the module supports) can possibly have.
> > 
> > - Before CSPs deploy/migrate TD on a TDX machine, they must be aware of
> > the "max_vcpus_per_td" the module supports, and only deploy/migrate TD to
> > it when it can support.
> > 
> > - For TDX 1.5.xx modules, the value is 576 (the previous number 512 isn't
> > correct); For TDX 2.0.xx modules, the value is larger (>1000).  For future
> > module versions, it could have a smaller number, depending on what
> > platforms that module needs to support.  Also, if TDX ever gets supported
> > on client platforms, we can image the number could be much smaller due to
> > the "vcpus per td no need to exceed physical logical cpus".
> > 
> > We may ask them to support the case where 'max_vcpus' for single TD
> > exceeds the physical logical cpus, or at least not to low down the value
> > any further for future modules (> 2.0.xx modules).  We may also ask them
> > to give promise to not low the number to below some certain value for any
> > future modules.  But I am not sure there's any concrete reason to do so?
> > 
> > What's your thinking?

It's a reasonable restriction, e.g. KVM_CAP_NR_VCPUS is already capped at number
of online CPUs, although userspace is obviously allowed to create oversubscribed
VMs.

I think the sane thing to do is document that TDX VMs are restricted to the number
of logical CPUs in the system, have KVM_CAP_MAX_VCPUS enumerate exactly that, and
then sanity check that max_vcpus_per_td is greater than or equal to what KVM
reports for KVM_CAP_MAX_VCPUS.

Stating that the maximum number of vCPUs depends on the whims TDX module doesn't
provide a predictable ABI for KVM, i.e. I don't want to simply forward TDX's
max_vcpus_per_td to userspace.