Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 4 Jun 2024 10:48:59 +0000

On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote:
> On Thu, May 30, 2024, Kai Huang wrote:
> > On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote:
> > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being
> > > less than KVM's minimum, then we can update KVM's minimum as needed.  But AFAICT,
> > > that's purely theoretical at this point, i.e. this is all much ado about nothing.
> > 
> > I am afraid we already have a legitimate case: TD partitioning.  Isaku
> > told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD
> > partitioning supported.  And again this is static, i.e., doesn't require
> > TD partitioning to be opt-in to low to 512.
> 
> So what's Intel's plan for use cases that creates TDs with >512 vCPUs?

I checked with TDX module guys.  Turns out the 'max_vcpus_per_td' wasn't
introduced because of TD partitioning, and they are not actually related.

They introduced this to support "topology virtualization", which requires
a table to record the X2APIC IDs for all vcpus for each TD.  In practice,
given a TDX module, the 'max_vcpus_per_td', a.k.a, the X2APIC ID table
size reflects the physical logical cpus that *ALL* platforms that the
module supports can possibly have.

The reason of this design is TDX guys don't believe there's sense in
supporting the case where the 'max_vcpus' for one single TD needs to
exceed the physical logical cpus.

So in short:

- The "max_vcpus_per_td" can be different depending on module versions. In
practice it reflects the maximum physical logical cpus that all the
platforms (that the module supports) can possibly have.

- Before CSPs deploy/migrate TD on a TDX machine, they must be aware of
the "max_vcpus_per_td" the module supports, and only deploy/migrate TD to
it when it can support.

- For TDX 1.5.xx modules, the value is 576 (the previous number 512 isn't
correct); For TDX 2.0.xx modules, the value is larger (>1000).  For future
module versions, it could have a smaller number, depending on what
platforms that module needs to support.  Also, if TDX ever gets supported
on client platforms, we can image the number could be much smaller due to
the "vcpus per td no need to exceed physical logical cpus".

We may ask them to support the case where 'max_vcpus' for single TD
exceeds the physical logical cpus, or at least not to low down the value
any further for future modules (> 2.0.xx modules).  We may also ask them
to give promise to not low the number to below some certain value for any
future modules.  But I am not sure there's any concrete reason to do so?

What's your thinking?