Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

Isaku Yamahata <isaku.yamahata@xxxxxxxxx> · Wed, 3 Apr 2024 10:24:17 -0700

On Thu, Mar 28, 2024 at 12:33:35PM +1300,
"Huang, Kai" <kai.huang@xxxxxxxxx> wrote:

> > +	kvm_tdx->tdr_pa = tdr_pa;
> > +
> > +	for_each_online_cpu(i) {
> > +		int pkg = topology_physical_package_id(i);
> > +
> > +		if (cpumask_test_and_set_cpu(pkg, packages))
> > +			continue;
> > +
> > +		/*
> > +		 * Program the memory controller in the package with an
> > +		 * encryption key associated to a TDX private host key id
> > +		 * assigned to this TDR.  Concurrent operations on same memory
> > +		 * controller results in TDX_OPERAND_BUSY.  Avoid this race by
> > +		 * mutex.
> > +		 */
> 
> IIUC the race can only happen when you are creating multiple TDX guests
> simulatenously?  Please clarify this in the comment.
> 
> And I even don't think you need all these TDX module details:
> 
> 		/*
> 		 * Concurrent run of TDH.MNG.KEY.CONFIG on the same
> 		 * package resluts in TDX_OPERAND_BUSY.  When creating
> 		 * multiple TDX guests simultaneously this can run
> 		 * concurrently.  Take the per-package lock to
> 		 * serialize.
> 		 */

As pointed by Chao, those mutex will be dropped.
https://lore.kernel.org/kvm/ZfpwIespKy8qxWWE@chao-email/
Also we would simplify cpu masks to track which package is online/offline,
which cpu to use for each package somehow.

> > +		mutex_lock(&tdx_mng_key_config_lock[pkg]);
> > +		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> > +				      &kvm_tdx->tdr_pa, true);
> > +		mutex_unlock(&tdx_mng_key_config_lock[pkg]);
> > +		if (ret)
> > +			break;
> > +	}
> > +	cpus_read_unlock();
> > +	free_cpumask_var(packages);
> > +	if (ret) {
> > +		i = 0;
> > +		goto teardown;
> > +	}
> > +
> > +	kvm_tdx->tdcs_pa = tdcs_pa;
> > +	for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > +		err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> > +		if (err == TDX_RND_NO_ENTROPY) {
> > +			/* Here it's hard to allow userspace to retry. */
> > +			ret = -EBUSY;
> > +			goto teardown;
> > +		}
> > +		if (WARN_ON_ONCE(err)) {
> > +			pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> > +			ret = -EIO;
> > +			goto teardown;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
> > +	 * ioctl() to define the configure CPUID values for the TD.
> > +	 */
> 
> Then, how about renaming this function to __tdx_td_create()?

So do we want to rename also ioctl name for consistency?
i.e. KVM_TDX_INIT_VM => KVM_TDX_CREATE_VM.

I don't have strong opinion those names. Maybe
KVM_TDX_{INIT, CREATE, or CONFIG}_VM?
And we can rename the function name to match it.

> > +	return 0;
> > +
> > +	/*
> > +	 * The sequence for freeing resources from a partially initialized TD
> > +	 * varies based on where in the initialization flow failure occurred.
> > +	 * Simply use the full teardown and destroy, which naturally play nice
> > +	 * with partial initialization.
> > +	 */
> > +teardown:
> > +	for (; i < tdx_info->nr_tdcs_pages; i++) {
> > +		if (tdcs_pa[i]) {
> > +			free_page((unsigned long)__va(tdcs_pa[i]));
> > +			tdcs_pa[i] = 0;
> > +		}
> > +	}
> > +	if (!kvm_tdx->tdcs_pa)
> > +		kfree(tdcs_pa);
> 
> The code to "free TDCS pages in a loop and free the array" is done below
> with duplicated code.  I am wondering whether we have way to eliminate one.
> 
> But I have lost track here, so perhaps we can review again after we split
> the patch to smaller pieces.

Surely we can simplify it.  Originally we had a spin lock and I had to separate
blocking memory allocation from its usage with this error clean up path.
Now it's mutex, we mix page allocation with its usage.

> > +	tdx_mmu_release_hkid(kvm);
> > +	tdx_vm_free(kvm);
> > +	return ret;
> > +
> > +free_packages:
> > +	cpus_read_unlock();
> > +	free_cpumask_var(packages);
> > +free_tdcs:
> > +	for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > +		if (tdcs_pa[i])
> > +			free_page((unsigned long)__va(tdcs_pa[i]));
> > +	}
> > +	kfree(tdcs_pa);
> > +	kvm_tdx->tdcs_pa = NULL;
> > +
> > +free_tdr:
> > +	if (tdr_pa)
> > +		free_page((unsigned long)__va(tdr_pa));
> > +	kvm_tdx->tdr_pa = 0;
> > +free_hkid:
> > +	if (is_hkid_assigned(kvm_tdx))
> > +		tdx_hkid_free(kvm_tdx);
> > +	return ret;
> > +}
> > +
> >   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> >   {
> >   	struct kvm_tdx_cmd tdx_cmd;
> > @@ -215,12 +664,13 @@ static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> >   static int __init tdx_module_setup(void)
> >   {
> > -	u16 num_cpuid_config;
> > +	u16 num_cpuid_config, tdcs_base_size;
> >   	int ret;
> >   	u32 i;
> >   	struct tdx_md_map mds[] = {
> >   		TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
> > +		TDX_MD_MAP(TDCS_BASE_SIZE, &tdcs_base_size),
> >   	};
> >   	struct tdx_metadata_field_mapping fields[] = {
> > @@ -273,6 +723,8 @@ static int __init tdx_module_setup(void)
> >   		c->edx = ecx_edx >> 32;
> >   	}
> > +	tdx_info->nr_tdcs_pages = tdcs_base_size / PAGE_SIZE;
> > +
> 
> Round up the 'tdcs_base_size' to make sure you have enough room, or put a
> WARN() here if not page aligned?

Ok, will add round up. Same for tdvps_base_size.
I can't find about those sizes and page size in the TDX spec.  Although
TDH.MNG.ADDCX() and TDH.VP.ADDCX() imply that those sizes are multiple of PAGE
SIZE, the spec doesn't guarantee it.  I think silent round up is better than
WARN() because we can do nothing about those values the TDX module provides.

> >   	return 0;
> >   error_out:
> > @@ -319,13 +771,27 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> >   	struct tdx_enabled enable = {
> >   		.err = ATOMIC_INIT(0),
> >   	};
> > +	int max_pkgs;
> >   	int r = 0;
> > +	int i;
> 
> Nit: you can put the 3 into one line.
> 
> > +	if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
> > +		pr_warn("MOVDIR64B is reqiured for TDX\n");
> 
> It's better to make it more clear:
> 
> "Disable TDX: MOVDIR64B is not supported or disabled by the kernel."
> 
> Or, to match below:
> 
> "Cannot enable TDX w/o MOVDIR64B".

Ok.

> > +		return -EOPNOTSUPP;
> > +	}
> >   	if (!enable_ept) {
> >   		pr_warn("Cannot enable TDX with EPT disabled\n");
> >   		return -EINVAL;
> >   	}
> > +	max_pkgs = topology_max_packages();
> > +	tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
> > +				   GFP_KERNEL);
> > +	if (!tdx_mng_key_config_lock)
> > +		return -ENOMEM;
> > +	for (i = 0; i < max_pkgs; i++)
> > +		mutex_init(&tdx_mng_key_config_lock[i]);
> > +
> 
> Using a per-socket lock looks a little bit overkill to me.  I don't know
> whether we need to do in the initial version.  Will leave to others.
> 
> Please at least add a comment to explain this is for better performance when
> creating multiple TDX guests IIUC?

Will delete the mutex and simply the related logic.
-- 
Isaku Yamahata <isaku.yamahata@xxxxxxxxx>