Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 26 Mar 2024 14:43:54 +1300

... continue the previous review ...

+
+static void tdx_reclaim_control_page(unsigned long td_page_pa)
+{
+	WARN_ON_ONCE(!td_page_pa);

From the name 'td_page_pa' we cannot tell whether it is a control page, 
but this function is only intended for control page AFAICT, so perhaps a 
more specific name.

+
+	/*
+	 * TDCX are being reclaimed.  TDX module maps TDCX with HKID

"are" -> "is".

Are you sure it is TDCX, but not TDCS?

AFAICT TDCX is the control structure for 'vcpu', but here you are 
handling the control structure for the VM.

+	 * assigned to the TD.  Here the cache associated to the TD
+	 * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
+	 * cache doesn't need to be flushed again.
+	 */

How about put this part as the comment for this function?

/*
 * Reclaim <name of control page> page(s) which are crypto-protected
 * by TDX guest's private KeyID.  Assume the cache associated with the
 * TDX private KeyID has been flushed.
 */
+	if (tdx_reclaim_page(td_page_pa))
+		/*
+		 * Leak the page on failure:
+		 * tdx_reclaim_page() returns an error if and only if there's an
+		 * unexpected, fatal error, e.g. a SEAMCALL with bad params,
+		 * incorrect concurrency in KVM, a TDX Module bug, etc.
+		 * Retrying at a later point is highly unlikely to be
+		 * successful.
+		 * No log here as tdx_reclaim_page() already did.

IMHO can be simplified to below, and nothing else matters.

	/*
	 * Leak the page if the kernel failed to reclaim the page.
	 * The krenel cannot use it safely anymore.
	 */

And you can put this comment above the 'if (tdx_reclaim_page())' statement.

+		 */
+		return;

Empty line.

+	free_page((unsigned long)__va(td_page_pa));
+}
+
+static void tdx_do_tdh_phymem_cache_wb(void *unused)

Better to make the name explicit that it is a smp_func, and you don't 
need the "tdx_" prefix for all the 'static' functions here:

	static void smp_func_do_phymem_cache_wb(void *unused)

+{
+	u64 err = 0;
+
+	do {
+		err = tdh_phymem_cache_wb(!!err);

		bool resume = !!err;

		err = tdh_phymem_cache_wb(resume);

So that we don't need to jump to the tdh_phymem_cache_wb() to see what 
does !!err mean.

+	} while (err == TDX_INTERRUPTED_RESUMABLE);

Add a comment before the do {} while():

	/*
	 * TDH.PHYMEM.CACHE.WB flushes caches associated with _ANY_
	 * TDX private KeyID on the package (or logical cpu?) where
	 * it is called on.  The TDX module may not finish the cache
	 * flush but return TDX_INTERRUPTED_RESUMEABLE instead.  The
	 * kernel should retry it until it returns success w/o
	 * rescheduling.
	 */
+
+	/* Other thread may have done for us. */
+	if (err == TDX_NO_HKID_READY_TO_WBCACHE)
+		err = TDX_SUCCESS;

Empty line.

+	if (WARN_ON_ONCE(err))
+		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
+}
+
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+	bool packages_allocated, targets_allocated;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	cpumask_var_t packages, targets;
+	u64 err;
+	int i;
+
+	if (!is_hkid_assigned(kvm_tdx))
+		return;
+
+	if (!is_td_created(kvm_tdx)) {
+		tdx_hkid_free(kvm_tdx);
+		return;
+	}

I lost tracking what does "td_created()" mean.

I guess it means: KeyID has been allocated to the TDX guest, but not yet 
programmed/configured.

Perhaps add a comment to remind the reviewer?

+
+	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
+	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
+	cpus_read_lock();
+
+	/*
+	 * We can destroy multiple guest TDs simultaneously.  Prevent
+	 * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
+	 */

IMHO it's better to remind people that TDH.PHYMEM.CACHE.WB tries to grab 
the global TDX module lock:

	/*
	 * TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global
	 * lock and can fail with TDX_OPERAND_BUSY when it fails to
	 * grab.  Multiple TDX guests can be destroyed simultaneously.
	 * Take the mutex to prevent it from getting error.
	 */
+	mutex_lock(&tdx_lock);
+
+	/*
+	 * Go through multiple TDX HKID state transitions with three SEAMCALLs
+	 * to make TDH.PHYMEM.PAGE.RECLAIM() usable. 

What is "TDX HKID state transitions"?  Not mentioned before, so needs 
explanation _if_ you want to say this.

And what are the three "SEAMCALLs"?  Where are they?  The only _two_ 
SEAMCALLs that I can see here are: TDH.PHYMEM.CACHE.WB and 
TDH.MNG.KEY.FREEID.

 Make the transition atomic
+	 * to other functions to operate private pages and Secure-EPT pages.

What's the consequence to "other functions" if we don't make it atomic here?

+	 *
+	 * Avoid race for kvm_gmem_release() to call kvm_mmu_unmap_gfn_range().
+	 * This function is called via mmu notifier, mmu_release().
+	 * kvm_gmem_release() is called via fput() on process exit.
+	 */
+	write_lock(&kvm->mmu_lock);

I don't fully get the race here, but it seems strange that this function 
is called via mmu notifier.

IIUC, this function is "supposedly" only be called when we tear down the 
VM, so I don't know why there's such race.

+
+	for_each_online_cpu(i) {
+		if (packages_allocated &&
+		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
+					     packages))
+			continue;
+		if (targets_allocated)
+			cpumask_set_cpu(i, targets);
+	}
+	if (targets_allocated)
+		on_each_cpu_mask(targets, tdx_do_tdh_phymem_cache_wb, NULL, true);
+	else
+		on_each_cpu(tdx_do_tdh_phymem_cache_wb, NULL, true);

I don't understand the logic here -- no comments whatever.

But I am 99% sure the logic here could be simplified.

+	/*
+	 * In the case of error in tdx_do_tdh_phymem_cache_wb(), the following
+	 * tdh_mng_key_freeid() will fail.
+	 */
+	err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
+	if (WARN_ON_ONCE(err)) {

I see KVM_BUG_ON() is normally used for SEAMCALL error.  Why this uses 
WARN_ON_ONCE() here?

+		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
+		pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
+		       kvm_tdx->hkid);
+	} else
+		tdx_hkid_free(kvm_tdx);
+
+	write_unlock(&kvm->mmu_lock);
+	mutex_unlock(&tdx_lock);
+	cpus_read_unlock();
+	free_cpumask_var(targets);
+	free_cpumask_var(packages);
+}
+
+void tdx_vm_free(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+	int i;
+
+	/*
+	 * tdx_mmu_release_hkid() failed to reclaim HKID.  Something went wrong
+	 * heavily with TDX module.  Give up freeing TD pages.  As the function
+	 * already warned, don't warn it again.
+	 */
+	if (is_hkid_assigned(kvm_tdx))
+		return;
+
+	if (kvm_tdx->tdcs_pa) {
+		for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
+			if (kvm_tdx->tdcs_pa[i])
+				tdx_reclaim_control_page(kvm_tdx->tdcs_pa[i]);

AFAICT, here tdcs_pa[i] cannot be NULL, right?  How about:

			if (!WARN_ON_ONCE(!kvm_tdx->tdcs_pa[i]))
				continue;

			tdx_reclaim_control_page(...);

which at least saves you some indent.

Btw, does it make sense to stop if any tdx_reclaim_control_page() fails?

It's OK to continue, but perhaps worth to add a comment to point out:

			/*
			 * Continue to reclaim other control pages and
			 * TDR page, even failed to reclaim one control
			 * page.  Do the best to reclaim these TDX
			 * private pages.
			 */
			tdx_reclaim_control_page();
+		}
+		kfree(kvm_tdx->tdcs_pa);
+		kvm_tdx->tdcs_pa = NULL;
+	}
+
+	if (!kvm_tdx->tdr_pa)
+		return;
+	if (__tdx_reclaim_page(kvm_tdx->tdr_pa))
+		return;
+	/*
+	 * TDX module maps TDR with TDX global HKID.  TDX module may access TDR
+	 * while operating on TD (Especially reclaiming TDCS).  Cache flush with > +	 * TDX global HKID is needed.
+	 */

"Especially reclaiming TDCS" -> "especially when it is reclaiming TDCS".

Use imperative mode to describe your change:

Use the SEAMCALL to ask the TDX module to flush the cache of it using 
the global KeyID.

+	err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(kvm_tdx->tdr_pa,
+						     tdx_global_keyid));
+	if (WARN_ON_ONCE(err)) {

Again, KVM_BUG_ON()?

Should't matter, though.

+		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+		return;
+	}
+	tdx_clear_page(kvm_tdx->tdr_pa);
+
+	free_page((unsigned long)__va(kvm_tdx->tdr_pa));
+	kvm_tdx->tdr_pa = 0;
+}
+
+static int tdx_do_tdh_mng_key_config(void *param)
+{
+	hpa_t *tdr_p = param;
+	u64 err;
+
+	do {
+		err = tdh_mng_key_config(*tdr_p);
+
+		/*
+		 * If it failed to generate a random key, retry it because this
+		 * is typically caused by an entropy error of the CPU's random
+		 * number generator.
+		 */
+	} while (err == TDX_KEY_GENERATION_FAILED);

If you want to handle TDX_KEY_GENERTION_FAILED, it's better to have a 
retry limit similar to the TDX host code does.

+
+	if (WARN_ON_ONCE(err)) {

KVM_BUG_ON()?

+		pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm);
+
+int tdx_vm_init(struct kvm *kvm)
+{
+	/*
+	 * TDX has its own limit of the number of vcpus in addition to
+	 * KVM_MAX_VCPUS.
+	 */
+	kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);

I believe this should be part of the patch that handles KVM_CAP_MAX_VCPUS.

+
+	/* Place holder for TDX specific logic. */
+	return __tdx_td_init(kvm);
+}
+

... to be continued ...