On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > > On Fri, Sep 13, 2024, Yan Zhao wrote: > > > This is a lock status report of TDX module for current SEAMCALL retry issue > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > > branch TDX_1.5.05. > > > > > > TL;DR: > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > > whatever reason. > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > > hits the fault? > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > > desirable because in many cases, the winning task will install a valid mapping > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > > instruction is re-executed. In the happy case, that provides optimal performance > > as KVM doesn't introduce any extra delay/latency. > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > > much higher, especially in light of the zero-step issues. > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > > then the TDX EPT violation path can safely retry locally, similar to the do-while > > loop in kvm_tdp_map_page(). > > > > The only part I don't like about this idea is having two "retry" return values, > > which creates the potential for bugs due to checking one but not the other. > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > > option better even though the out-param is a bit gross, because it makes it more > > obvious that the "frozen_spte" is a special case that doesn't need attention for > > most paths. > Good idea. > But could we extend it a bit more to allow TDX's EPT violation handler to also > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing FROZEN_SPTE might not be enough to prevent zero step mitigation. E.g. in below selftest with a TD configured with pending_ve_disable=N, zero step mitigation can be triggered on a vCPU that is stuck in EPT violation vm exit for more than 6 times (due to that user space does not do memslot conversion correctly). So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck in EPT violation vm exits. #include <stdint.h> #include "kvm_util.h" #include "processor.h" #include "tdx/tdcall.h" #include "tdx/tdx.h" #include "tdx/tdx_util.h" #include "tdx/test_util.h" #include "test_util.h" /* * 0x80000000 is arbitrarily selected, but it should not overlap with selftest * code or boot page. */ #define ZERO_STEP_TEST_AREA_GPA (0x80000000) /* Test area GPA is arbitrarily selected */ #define ZERO_STEP_AREA_GVA_PRIVATE (0x90000000) /* The test area is 2MB in size */ #define ZERO_STEP_AREA_SIZE (2 << 20) #define ZERO_STEP_ASSERT(x) \ do { \ if (!(x)) \ tdx_test_fatal(__LINE__); \ } while (0) #define ZERO_STEP_ACCEPT_PRINT_PORT 0x87 #define ZERO_STEP_THRESHOLD 6 #define TRIGGER_ZERO_STEP_MITIGATION 1 static int convert_request_cnt; static void guest_test_zero_step(void) { void *test_area_gva_private = (void *)ZERO_STEP_AREA_GVA_PRIVATE; memset(test_area_gva_private, 1, 8); tdx_test_success(); } static void guest_ve_handler(struct ex_regs *regs) { uint64_t ret; struct ve_info ve; ret = tdg_vp_veinfo_get(&ve); ZERO_STEP_ASSERT(!ret); /* For this test, we will only handle EXIT_REASON_EPT_VIOLATION */ ZERO_STEP_ASSERT(ve.exit_reason == EXIT_REASON_EPT_VIOLATION); tdx_test_send_64bit(ZERO_STEP_ACCEPT_PRINT_PORT, ve.gpa); #define MEM_PAGE_ACCEPT_LEVEL_4K 0 #define MEM_PAGE_ACCEPT_LEVEL_2M 1 ret = tdg_mem_page_accept(ve.gpa, MEM_PAGE_ACCEPT_LEVEL_4K); ZERO_STEP_ASSERT(!ret); } static void zero_step_test(void) { struct kvm_vm *vm; struct kvm_vcpu *vcpu; void *guest_code; uint64_t test_area_npages; vm_vaddr_t test_area_gva_private; vm = td_create(); td_initialize(vm, VM_MEM_SRC_ANONYMOUS, 0); guest_code = guest_test_zero_step; vcpu = td_vcpu_add(vm, 0, guest_code); vm_install_exception_handler(vm, VE_VECTOR, guest_ve_handler); test_area_npages = ZERO_STEP_AREA_SIZE / vm->page_size; vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, ZERO_STEP_TEST_AREA_GPA, 3, test_area_npages, KVM_MEM_GUEST_MEMFD); vm->memslots[MEM_REGION_TEST_DATA] = 3; test_area_gva_private = ____vm_vaddr_alloc( vm, ZERO_STEP_AREA_SIZE, ZERO_STEP_AREA_GVA_PRIVATE, ZERO_STEP_TEST_AREA_GPA, MEM_REGION_TEST_DATA, true); TEST_ASSERT_EQ(test_area_gva_private, ZERO_STEP_AREA_GVA_PRIVATE); td_finalize(vm); handle_memory_conversion(vm, ZERO_STEP_TEST_AREA_GPA, ZERO_STEP_AREA_SIZE, false); for (;;) { vcpu_run(vcpu); if (vcpu->run->exit_reason == KVM_EXIT_IO && vcpu->run->io.port == ZERO_STEP_ACCEPT_PRINT_PORT) { uint64_t gpa = tdx_test_read_64bit( vcpu, ZERO_STEP_ACCEPT_PRINT_PORT); printf("\t ... guest accepting 1 page at GPA: 0x%lx\n", gpa); continue; } else if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT) { bool skip = TRIGGER_ZERO_STEP_MITIGATION && (convert_request_cnt < ZERO_STEP_THRESHOLD -1); convert_request_cnt++; printf("guest request conversion of gpa 0x%llx - 0x%llx to %s, skip=%d\n", vcpu->run->memory_fault.gpa, vcpu->run->memory_fault.size, (vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE) ? "private" : "shared", skip); if (skip) continue; handle_memory_conversion( vm, vcpu->run->memory_fault.gpa, vcpu->run->memory_fault.size, vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE); continue; } printf("exit reason %d\n", vcpu->run->exit_reason); break; } kvm_vm_free(vm); } int main(int argc, char **argv) { /* Disable stdout buffering */ setbuf(stdout, NULL); if (!is_tdx_enabled()) { printf("TDX is not supported by the KVM\n" "Skipping the TDX tests.\n"); return 0; } run_in_new_process(&zero_step_test); }