Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 14 May 2024 14:01:12 +1200

On 11/05/2024 2:04 am, Sean Christopherson wrote:
On Thu, May 09, 2024, Isaku Yamahata wrote:
On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@xxxxxxxxx> wrote:
On 10/05/2024 10:52 am, Sean Christopherson wrote:
On Fri, May 10, 2024, Kai Huang wrote:
On 10/05/2024 4:35 am, Sean Christopherson wrote:
KVM x86 limits KVM_MAX_VCPUS to 4096:

     config KVM_MAX_NR_VCPUS
	int "Maximum number of vCPUs per KVM guest"
	depends on KVM
	range 1024 4096
	default 4096 if MAXSMP
	default 1024
	help

whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
a 16-bit unsigned value:

     #define TDX_MAX_VCPUS  (~(u16)0)

i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
And _if_ it becomes a problem, we don't necessarily need to have a different
_runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
being <= 64k.

Actually later versions of TDX module (starting from 1.5 AFAICT), the module
has a metadata field to report the maximum vCPUs that the module can support
for all TDX guests.

My quick glance at the 1.5 source shows that the limit is still effectively
0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
max at runtime and simply refuse to use a TDX module that has dropped the minimum
below 0xffff.

I need to double check why this metadata field was added.  My concern is in
future module versions they may just low down the value.

TD partitioning would reduce it much.

That's still not a reason to plumb in what is effectively dead code.  Either
partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express
the limitations to userspace, or the TDX-module is potentially breaking existing
use cases.

The 'max_vcpus_per_td' global metadata fields is static for the TDX 
module.  If the module supports the TD partitioning, it just reports 
some smaller value regardless whether we opt-in TDX partitioning or not.

I think the point is this 'max_vcpus_per_td' is TDX architectural thing 
and kernel should not make any assumption of the value of it.  The 
architectural behaviour is:

  If the module has this 'max_vcpus_per_td', software should read and
  use it; Otherwise software should treat it as U16_MAX.

Thus I don't think we will need a new uAPI (TDX specific presumably) 
just for TD partitioning.  And this doesn't break existing use cases.

In fact, this doesn't prevent us from making the KVM_CAP_MAX_VCPUS code 
generic, e.g., we can do below:

1) In tdx_vm_init() (called via KVM_VM_CREATE -> vt_vm_init()), we do:

	kvm->max_vcpus = min(kvm->max_vcpus,
				tdx_info->max_vcpus_per_td);

2) In kvm_vm_ioctl_enable_cap_generic(), we add support to handle 
KVM_CAP_MAX_VCPUS to have the generic code to do:

	if (new_max_vcpus > kvm->max_vcpus)
		return -EINVAL;

	kvm->max_vcpus = new_max_vcpus;

However this means we only allow "lowing down" the kvm->max_vcpus in the 
kvm_vm_ioctl_enable_cap_generic(KVM_CAP_MAX_VCPUS), but I think this is 
acceptable?

If it is a concern, alternatively, we can add a new 
'kvm->hard_max_vcpus' (or whatever makes sense), and set it in 
kvm_create_vm() right after kvm_arch_init_vm():

	r = kvm_arch_init_vm(kvm, type);
        if (r)
                goto out_err_no_arch_destroy_vm;

	kvm->hard_max_vcpus = kvm->max_vcpus;

So it always contains "the max_vcpus limited by the ARCH 
hardware/firmware etc".

And in kvm_vm_ioctl_enable_cap_generic(), we check against 
kvm->hard_max_vcpus instead to get rid of the limitation of only 
allowing lowing down the kvm->max_vcpus.

But I don't think this is necessary at this stage.