Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

"Huang, Kai" <kai.huang@xxxxxxxxx> · Thu, 30 May 2024 12:21:58 +0000

On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote:
> On Tue, May 14, 2024, Kai Huang wrote:
> > 
> > 
> > On 11/05/2024 2:04 am, Sean Christopherson wrote:
> > > On Thu, May 09, 2024, Isaku Yamahata wrote:
> > > > On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@xxxxxxxxx> wrote:
> > > > > On 10/05/2024 10:52 am, Sean Christopherson wrote:
> > > > > > On Fri, May 10, 2024, Kai Huang wrote:
> > > > > > > On 10/05/2024 4:35 am, Sean Christopherson wrote:
> > > > > > > > KVM x86 limits KVM_MAX_VCPUS to 4096:
> > > > > > > > 
> > > > > > > >      config KVM_MAX_NR_VCPUS
> > > > > > > > 	int "Maximum number of vCPUs per KVM guest"
> > > > > > > > 	depends on KVM
> > > > > > > > 	range 1024 4096
> > > > > > > > 	default 4096 if MAXSMP
> > > > > > > > 	default 1024
> > > > > > > > 	help
> > > > > > > > 
> > > > > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
> > > > > > > > a 16-bit unsigned value:
> > > > > > > > 
> > > > > > > >      #define TDX_MAX_VCPUS  (~(u16)0)
> > > > > > > > 
> > > > > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
> > > > > > > > And _if_ it becomes a problem, we don't necessarily need to have a different
> > > > > > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
> > > > > > > > being <= 64k.
> > > > > > > 
> > > > > > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module
> > > > > > > has a metadata field to report the maximum vCPUs that the module can support
> > > > > > > for all TDX guests.
> > > > > > 
> > > > > > My quick glance at the 1.5 source shows that the limit is still effectively
> > > > > > 0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
> > > > > > max at runtime and simply refuse to use a TDX module that has dropped the minimum
> > > > > > below 0xffff.
> > > > > 
> > > > > I need to double check why this metadata field was added.  My concern is in
> > > > > future module versions they may just low down the value.
> > > > 
> > > > TD partitioning would reduce it much.
> > > 
> > > That's still not a reason to plumb in what is effectively dead code.  Either
> > > partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express
> > > the limitations to userspace, or the TDX-module is potentially breaking existing
> > > use cases.
> > 
> > The 'max_vcpus_per_td' global metadata fields is static for the TDX module.
> > If the module supports the TD partitioning, it just reports some smaller
> > value regardless whether we opt-in TDX partitioning or not.
> > 
> > I think the point is this 'max_vcpus_per_td' is TDX architectural thing and
> > kernel should not make any assumption of the value of it.
> 
> It's not an assumption, it's a requirement.  And KVM already places requirements
> on "hardware", e.g. kvm-intel.ko will refuse to load if the CPU doesn't support
> a bare mimimum VMX feature set.  Refusing to enable TDX because max_vcpus_per_td
> is unexpectedly low isn't fundamentally different than refusing to enable VMX
> because IRQ window exiting is unsupported.

OK.  I have no argument against this.

But I am not sure why we need to have such requirement.  See below.

> 
> In the unlikely event there is a legitimate reason for max_vcpus_per_td being
> less than KVM's minimum, then we can update KVM's minimum as needed.  But AFAICT,
> that's purely theoretical at this point, i.e. this is all much ado about nothing.

I am afraid we already have a legitimate case: TD partitioning.  Isaku
told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD
partitioning supported.  And again this is static, i.e., doesn't require
TD partitioning to be opt-in to low to 512.

So AFAICT this isn't a theoretical thing now.

Also, I want to say I was wrong about "MAX_VCPUS" in the TD_PARAMS is part
of attestation.  It is not.  TDREPORT dosen't include the "MAX_VCPUS", and
it is not involved in the calculation of the measurement of the guest.

Given "MAX_VCPUS" is not part of attestation, I think there's no need to
allow user to change kvm->max_vcpus by enabling KVM_ENABLE_CAP ioctl() for
KVM_CAP_MAX_VCPUS.

So we could just once for all adjust kvm->max_vcpus for TDX in the
tdx_vm_init() for TDX guest:

	kvm->max_vcpus = min(kvm->max_vcpus, tdx_info->max_vcpus_per_td);

AFAICT no other change is needed.

And in KVM_TDX_VM_INIT (where TDH.MNG.INIT is done) we can just use kvm-
>max_vcpus to fill the "MAX_VCPUS" in TD_PARAMS.