Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

"Huang, Kai" <kai.huang@xxxxxxxxx> · Fri, 10 May 2024 10:40:56 +1200

On 10/05/2024 4:35 am, Sean Christopherson wrote:
On Mon, Feb 26, 2024, isaku.yamahata@xxxxxxxxx wrote:
From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>

TDX has its own limitation on the maximum number of vcpus that the guest
can accommodate.  Allow x86 kvm backend to implement its own KVM_ENABLE_CAP
handler and implement TDX backend for KVM_CAP_MAX_VCPUS.  user space VMM,
e.g. qemu, can specify its value instead of KVM_MAX_VCPUS.

When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be
specified as struct td_params_struct.  and the value is a part of
measurement.  The user space has to specify the value somehow.  There are
two options for it.
option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch)
When I suggested adding a capability[*], the intent was for the capability to
be generic, not buried in TDX code.  I can't think of any reason why this can't
be supported for all VMs on all architectures.  The only wrinkle is that it'll
require a separate capability since userspace needs to be able to detect that
KVM supports restricting the number of vCPUs, but that'll still be _less_ code.

[*] https://lore.kernel.org/all/YZVsnZ8e7cXls2P2@xxxxxxxxxx

+static int vt_max_vcpus(struct kvm *kvm)
+{
+	if (!kvm)
+		return KVM_MAX_VCPUS;
+
+	if (is_td(kvm))
+		return min(kvm->max_vcpus, TDX_MAX_VCPUS);
+
+	return kvm->max_vcpus;
This is _completely_ orthogonal to allowing userspace to restrict the maximum
number of vCPUs.  And unless I'm missing something, it's also ridiculous and
unnecessary at this time.  
Right it's not necessary.  I think it can be reported as:

        case KVM_CAP_MAX_VCPUS:
                r = KVM_MAX_VCPUS;
+               if (kvm)
+                       r = kvm->max_vcpus;
                break;


KVM x86 limits KVM_MAX_VCPUS to 4096:

   config KVM_MAX_NR_VCPUS
	int "Maximum number of vCPUs per KVM guest"
	depends on KVM
	range 1024 4096
	default 4096 if MAXSMP
	default 1024
	help

whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
a 16-bit unsigned value:

   #define TDX_MAX_VCPUS  (~(u16)0)

i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
And _if_ it becomes a problem, we don't necessarily need to have a different
_runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
being <= 64k.
Actually later versions of TDX module (starting from 1.5 AFAICT), the 
module has a metadata field to report the maximum vCPUs that the module 
can support for all TDX guests.
So rather than add a bunch of pointless plumbing, just throw in

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 137d08da43c3..018d5b9eb93d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2488,6 +2488,9 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
                 return -EOPNOTSUPP;
         }
  
+       BUILD_BUG_ON(CONFIG_KVM_MAX_NR_VCPUS <
+                    sizeof(td_params->max_vcpus) * BITS_PER_BYTE);
+
         td_params->max_vcpus = kvm->max_vcpus;
         td_params->attributes = init_vm->attributes;
         /* td_params->exec_controls = TDX_CONTROL_FLAG_NO_RBP_MOD; */

Yeah the above could be helpful, but might not be necessary.

So the logic of updated patch is:

1) During module loading time, we grab the maximum vCPUs that the TDX 
module can support:

	/kvm_vm_ioctl_enable_cap
	 * TDX module may not support MD_FIELD_ID_MAX_VCPUS_PER_TD
	 * depending on its version.
	 */
	tdx_info->max_vcpus_per_td = U16_MAX;
	if (!tdx_sys_metadata_field_read(MD_FIELD_ID_MAX_VCPUS_PER_TD,
					&tmp))
	        tdx_info->max_vcpus_per_td = (u16)tmp;

2) When TDX guest is created, the userspace needs to call 
IOCTL(KVM_ENABLE_CAP) to configure the maximum vCPUs of the guest.  A 
new kvm_x86_ops::vm_enable_cap() is added because TDX has it's own 
limitation (metadata field) as mentioned above.
@@ -6827,6 +6829,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
        }
        default:
                r = -EINVAL;
+               if (kvm_x86_ops.vm_enable_cap)
+                       r = static_call(kvm_x86_vm_enable_cap)(kvm,
+				 cap);

And we only allow the kvm->max_vcpus to be updated if it's a TDX guest 
in the vt_vm_enable_cap().  The reason is we want to avoid unnecessary 
change for normal VMX guests.
And the new kvm->max_vcpus cannot exceed the KVM_MAX_VCPUS and the 
tdx_info->max_vcpus_per_td:

+       case KVM_CAP_MAX_VCPUS: {
+               if (cap->flags || cap->args[0] == 0)
+                       return -EINVAL;
+               if (cap->args[0] > KVM_MAX_VCPUS ||
+                   cap->args[0] > tdx_info->max_vcpus_per_td)
+                       return -E2BIG;
+
+               mutex_lock(&kvm->lock);
+               if (kvm->created_vcpus)
+                       r = -EBUSY;
+               else {
+                       kvm->max_vcpus = cap->args[0];
+                       r = 0;
+               }
+               mutex_unlock(&kvm->lock);
+               break;
+       }

3) We just report kvm->max_vcpus when the userspace wants to check the 
KVM_CAP_MAX_VCPUS as shown in the beginning of my reply.
Does this make sense to you?

I am also pasting the new updated patch for your review (there are line 
wrapper issues unfortunately due to the simple copy/paste):
From 797e439634d106f744517c97c5ea7887e494fc44 Mon Sep 17 00:00:00 2001
From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
Date: Thu, 16 Feb 2023 17:03:40 -0800
Subject: [PATCH] KVM: TDX: Allow userspace to configure maximum vCPUs 
for TDX guests
TDX has its own mechanism to control the maximum number of vCPUs that
the TDX guest can use.  When creating a TDX guest, the maximum number of
vCPUs of the guest needs to be passed to the TDX module as part of the
measurement of the guest.  Depending on TDX module's version, it may
also report the maximum vCPUs it can support for all TDX guests.

Because the maximum number of vCPUs is part of the measurement, thus
part of attestation, it's better to allow the userspace to be able to
configure it.  E.g. the users may want to precisely control the maximum
number of vCPUs their precious VMs can use.

The actual control itself must be done via the TDH.MNG.INIT SEAMCALL,
where the number of maximum cpus is part of the input to the TDX module,
but KVM needs to support the "per-VM maximum number of vCPUs" and
reflect that in the KVM_CAP_MAX_VCPUS.

Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but
doesn't allow to enable KVM_CAP_MAX_VCPUS to configure the number of
maximum vCPUs on VM-basis.

Add "per-VM maximum number of vCPUs" to KVM x86/TDX to accommodate TDX's
needs.

Specifically, use KVM's existing KVM_ENABLE_CAP IOCTL() to allow the
userspace to configure the maximum vCPUs by making KVM x86 support
enabling the KVM_CAP_MAX_VCPUS cap on VM-basis.

For that, add a new 'kvm_x86_ops::vm_enable_cap()' callback and call
it from kvm_vm_ioctl_enable_cap() as a placeholder to handle the
KVM_CAP_MAX_VCPUS for TDX guests (and other KVM_CAP_xx for TDX and/or
other VMs if needed in the future).

Implement the callback for TDX guest to check whether the maximum vCPUs
passed from usrspace can be supported by TDX, and if it can, override
Accordingly, in the KVM_CHECK_EXTENSION IOCTL(), change to return the
'struct kvm::max_vcpus' for a given VM for the KVM_CAP_MAX_VCPUS.

Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
---
v20:
- Drop max_vcpu ops to use kvm.max_vcpus
- Remove TDX_MAX_VCPUS (Kai)
- Use type cast (u16) instead of calling memcpy() when reading the
   'max_vcpus_per_td' (Kai)
- Improve change log and change patch title from "KVM: TDX: Make
  KVM_CAP_MAX_VCPUS backend specific" (Kai)
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/vmx/main.c            | 10 ++++++++
 arch/x86/kvm/vmx/tdx.c             | 40 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h         |  5 ++++
 arch/x86/kvm/x86.c                 |  4 +++
 6 files changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index bcb8302561f2..022b9eace3a5 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(hardware_disable)
 KVM_X86_OP(hardware_unsetup)
 KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP_OPTIONAL(vm_enable_cap)
 KVM_X86_OP(vm_init)
 KVM_X86_OP_OPTIONAL(vm_destroy)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h 
b/arch/x86/include/asm/kvm_host.h
index c461c2e57fcb..1d10e3d29533 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1639,6 +1639,7 @@ struct kvm_x86_ops {
        void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

        unsigned int vm_size;
+       int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
        int (*vm_init)(struct kvm *kvm);
        void (*vm_destroy)(struct kvm *kvm);

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8e4aa8d15aec..686ca6348993 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,7 @@
 #include "nested.h"
 #include "pmu.h"
 #include "tdx.h"
+#include "tdx_arch.h"

 static bool enable_tdx __ro_after_init;
 module_param_named(tdx, enable_tdx, bool, 0444);
@@ -33,6 +34,14 @@ static void vt_hardware_unsetup(void)
        vmx_hardware_unsetup();
 }

+static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+       if (is_td(kvm))
+               return tdx_vm_enable_cap(kvm, cap);
+
+       return -EINVAL;
+}
+

 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
        if (!is_td(kvm))
@@ -63,6 +72,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
        .has_emulated_msr = vmx_has_emulated_msr,

        .vm_size = sizeof(struct kvm_vmx),
+       .vm_enable_cap = vt_vm_enable_cap,
        .vm_init = vmx_vm_init,
        .vm_destroy = vmx_vm_destroy,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c7d849582d44..cdfc95904d6c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -34,6 +34,8 @@ struct tdx_info {
        u64 xfam_fixed0;
        u64 xfam_fixed1;

+       u16 max_vcpus_per_td;
+
        u16 num_cpuid_config;
        /* This must the last member. */
        DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
@@ -42,6 +44,35 @@ struct tdx_info {
 /* Info about the TDX module. */
 static struct tdx_info *tdx_info;

+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+       int r;
+
+       switch (cap->cap) {
+       case KVM_CAP_MAX_VCPUS: {
+               if (cap->flags || cap->args[0] == 0)
+                       return -EINVAL;
+               if (cap->args[0] > KVM_MAX_VCPUS ||
+                   cap->args[0] > tdx_info->max_vcpus_per_td)
+                       return -E2BIG;
+
+               mutex_lock(&kvm->lock);
+               if (kvm->created_vcpus)
+                       r = -EBUSY;
+               else {
+                       kvm->max_vcpus = cap->args[0];
+                       r = 0;
+               }
+               mutex_unlock(&kvm->lock);
+               break;
+       }
+       default:
+               r = -EINVAL;
+               break;
+       }
+       return r;
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
        struct kvm_tdx_capabilities __user *user_caps;
@@ -129,6 +160,7 @@ static int __init tdx_module_setup(void)
                u16 num_cpuid_config;
                /* More member will come. */
        } st;
+       u64 tmp;
        int ret;
        u32 i;

@@ -167,6 +199,14 @@ static int __init tdx_module_setup(void)
                return -ENOMEM;
        tdx_info->num_cpuid_config = st.num_cpuid_config;

+       /*
+        * TDX module may not support MD_FIELD_ID_MAX_VCPUS_PER_TD depending
+        * on its version.
+        */
+       tdx_info->max_vcpus_per_td = U16_MAX;
+       if (!tdx_sys_metadata_field_read(MD_FIELD_ID_MAX_VCPUS_PER_TD, 
&tmp))
+               tdx_info->max_vcpus_per_td = (u16)tmp;
+
        ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
        ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
        if (ret)
                goto error_out;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 9bc287a7efac..7c768e360bc6 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -139,11 +139,16 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
 int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 void tdx_hardware_unsetup(void);

+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { 
return -EOPNOTSUPP; }
 static inline void tdx_hardware_unsetup(void) {}

+static inline int tdx_vm_enable_cap(struct kvm *kvm, struct 
kvm_enable_cap *cap)
+{
+       return -EINVAL;
+};
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { 
return -EOPNOTSUPP; }
 #endif

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee8288a46d30..97ed4fe25964 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4776,6 +4776,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
long ext)
                break;
        case KVM_CAP_MAX_VCPUS:
                r = KVM_MAX_VCPUS;
+               if (kvm)
+                       r = kvm->max_vcpus;
                break;
        case KVM_CAP_MAX_VCPU_ID:
                r = KVM_MAX_VCPU_IDS;
@@ -6827,6 +6829,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
        }
        default:
                r = -EINVAL;
+               if (kvm_x86_ops.vm_enable_cap)
+                       r = static_call(kvm_x86_vm_enable_cap)(kvm, cap);
                break;
        }
        return r;
--
2.34.1