On Wed, 29 Mar 2023 18:01:53 -0700 Isaku Yamahata <isaku.yamahata@xxxxxxxxx> wrote: > Hi, thanks for review. > > On Sun, Mar 26, 2023 at 02:09:36PM +0300, > Zhi Wang <zhi.wang.linux@xxxxxxxxx> wrote: > > > On Sun, 12 Mar 2023 10:55:43 -0700 > > isaku.yamahata@xxxxxxxxx wrote: > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > > index 8b02e605cfb5..3ede8a726b47 100644 > > > --- a/arch/x86/kvm/vmx/tdx.c > > > +++ b/arch/x86/kvm/vmx/tdx.c > > > @@ -5,8 +5,9 @@ > > > > > > #include "capabilities.h" > > > #include "x86_ops.h" > > > -#include "x86.h" > > > #include "tdx.h" > > > +#include "tdx_ops.h" > > > +#include "x86.h" > > > > > > #undef pr_fmt > > > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > > @@ -46,11 +47,276 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) > > > return r; > > > } > > > > > > +struct tdx_info { > > > + u8 nr_tdcs_pages; > > > +}; > > > + > > > +/* Info about the TDX module. */ > > > +static struct tdx_info tdx_info; > > > + > > > +/* > > > + * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB, > > > + * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a global lock > > > + * internally in TDX module. If failed, TDX_OPERAND_BUSY is returned without > > > + * spinning or waiting due to a constraint on execution time. It's caller's > > > + * responsibility to avoid race (or retry on TDX_OPERAND_BUSY). Use this mutex > > > + * to avoid race in TDX module because the kernel knows better about scheduling. > > > + */ > > > +static DEFINE_MUTEX(tdx_lock); > > > +static struct mutex *tdx_mng_key_config_lock; > > > + > > > +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid) > > > +{ > > > + return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits); > > > +} > > > + > > > +static inline bool is_td_created(struct kvm_tdx *kvm_tdx) > > > +{ > > > + return kvm_tdx->tdr_pa; > > > +} > > > + > > > +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx) > > > +{ > > > + tdx_guest_keyid_free(kvm_tdx->hkid); > > > + kvm_tdx->hkid = 0; > > > +} > > > + > > > +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx) > > > +{ > > > + return kvm_tdx->hkid > 0; > > > +} > > > + > > > int tdx_hardware_enable(void) > > > { > > > return tdx_cpu_enable(); > > > } > > > > > > +static void tdx_clear_page(unsigned long page_pa) > > > +{ > > > + const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0))); > > > + void *page = __va(page_pa); > > > + unsigned long i; > > > + > > > + if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) { > > > + clear_page(page); > > > + return; > > > + } > > > > Is it possbile to have a TDX machine without MOVDIR64B support? I am not sure > > if there is any other way for the kernel to clear the posioned cache line. > > > > If not, there should be a warn/bug at least and check if MOVDIR64B support > > when initializing the TDX. > > Because the latest TDX specification uses movdir64b, it's safe for TDX > to assume movdir64b. > I'll add the check to TDX initialization part and drop it from here. > > > > > +void tdx_mmu_release_hkid(struct kvm *kvm) > > > +{ > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > > + cpumask_var_t packages; > > > + bool cpumask_allocated; > > > + u64 err; > > > + int ret; > > > + int i; > > > + > > > + if (!is_hkid_assigned(kvm_tdx)) > > > + return; > > > + > > > + if (!is_td_created(kvm_tdx)) > > > + goto free_hkid; > > > + > > > + cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL); > > > + cpus_read_lock(); > > > + for_each_online_cpu(i) { > > > + if (cpumask_allocated && > > > + cpumask_test_and_set_cpu(topology_physical_package_id(i), > > > + packages)) > > > + continue; > > > > Is this necessary to check cpumask_allocated in the while loop? if cpumask > > is not succefully allocated, wouldn't it be better to bail out just after > > it? > > No because we can't return error here. It's better to do in-efficiently freeing > resources instead of leak. > > We can move the check out of loop. But it would be ugly > (if () {cpu loop} else {cpu loop} ) and this function isn't performance > critical. Also I think it's okay to depend on compiler optimization for loop > invariant. My compiler didn't optimize it in this case, though. > Do you mean the tdh_mng_key_freeid() is still required if failing to allocate the cpumask var and do TDH.PHYMEM_CACHE_WB(WBINVD) on each CPU? Out of curiosity, I took a look on the TDX module source code [1], it seems TDX module has an additional check in TDH.MNG.KEY.FREEID. TDH.MNG.VPFLUSHDONE [2] will mark the pending wbinvd in a bitmap: ... /** * Create the WBINVD_BITMAP per-package. * Set to 1 num_of_pkgs bits from the LSB */ global_data_ptr->kot.entries[curr_hkid].wbinvd_bitmap = global_data_ptr->pkg_config_bitmap; /* <----HERE */ // Set new TD life cycle state tdr_ptr->management_fields.lifecycle_state = TD_BLOCKED; // Set the proper new KOT entry state global_data_ptr->kot.entries[curr_hkid].state = (uint8_t)KOT_STATE_HKID_FLUSHED; ... And TDH.MNG.KEY.FREEID [3] will check if the pending WBINVD has been performed: ... /** * If TDH_PHYMEM_CACHE_WB was executed on all packages/cores, * set the KOT entry, set the KOT entry state to HKID_FREE. */ curr_hkid = tdr_ptr->key_management_fields.hkid; tdx_debug_assert(global_data_ptr->kot.entr/ies[curr_hkid].state == KOT_STATE_HKID_FLUSHED); if (global_data_ptr->kot.entries[curr_hkid].wbinvd_bitmap != 0) /* HERE */ { TDX_ERROR("CACHEWB is not complete for this HKID (=%x)\n", curr_hkid); return_val = TDX_WBCACHE_NOT_COMPLETE; goto EXIT; } ... Guess the conclusion is: if TDH.PHYMEM.CACHE.WB is not performed on each required CPU correctly, TDH.MNG.KEY.FREEID will fail as well. A leak seems the only option (none of us likes a leak, but...). It would be better that: 1) Leave a comment about the finding above in the code to explain why leak happens (It is always nice to explain the reason of a leak). One sentence will be good enough. 2) If failing to allocate the cpumask, bail out (with the findings above) ... cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL); if (!cpumask_allocated) return; ... 3) As the reason of the leak will be explained in tdx_mmu_release_hkid(), leaving a pointer in the comment in tdx_vm_free() to refer to tdx_mmu_release_hkid() would be enough, like: ... /* * Can't reclaim or free TD pages if teardown failed in * tdx_mmu_release_hkid(). */ if (is_hkid_assigned(kvm_tdx)) return; ... [1] https://downloadmirror.intel.com/738876/tdx-module-v1.0.01.01.zip [2] tdx-module/src/vmm_dispatcher/api_calls/tdh_mng_vpflushdone.c [3] tdx-module/src/vmm_dispatcher/api_calls/tdh_mng_key_freeid.c > > > > + > > > + /* > > > + * We can destroy multiple the guest TDs simultaneously. > > > + * Prevent tdh_phymem_cache_wb from returning TDX_BUSY by > > > + * serialization. > > > + */ > > > + mutex_lock(&tdx_lock); > > > + ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1); > > > + mutex_unlock(&tdx_lock); > > > + if (ret) > > > + break; > > > + } > > > + cpus_read_unlock(); > > > + free_cpumask_var(packages); > > > + > > > + mutex_lock(&tdx_lock); > > > + err = tdh_mng_key_freeid(kvm_tdx->tdr_pa); > > > + mutex_unlock(&tdx_lock); > > > + if (WARN_ON_ONCE(err)) { > > > + pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL); > > > + pr_err("tdh_mng_key_freeid failed. HKID %d is leaked.\n", > > > + kvm_tdx->hkid); > > > + return; > > > + } > > > + > > > +free_hkid: > > > + tdx_hkid_free(kvm_tdx); > > > +} > > > + > > > +void tdx_vm_free(struct kvm *kvm) > > > +{ > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > > + int i; > > > + > > > + /* Can't reclaim or free TD pages if teardown failed. */ > > > + if (is_hkid_assigned(kvm_tdx)) > > > + return; > > > + > > > > Better to explain why, as it is common to think even the teardown failed, we > > should still try to reclaim the pages as many as we can. > > Ok, here is the updated comment. > /* > * tdx_mmu_release_hkid() failed to reclaim HKID. Something went wrong > * heavily with TDX module. Give up freeing TD pages. As the function > * already warned, don't warn it again. > */