This SEPT SEAMCALL retry proposal aims to remove patch "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" [1] at the tail of v2 series "TDX MMU Part 2". ==brief history== In the v1 series 'TDX MMU Part 2', there were several discussions regarding the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the SEAMCALL wrapper tdx_seamcall_sept(). The lock status of each SEAMCALL relevant to KVM was analyzed in [2]. The conclusion was that 16 retries was necessary because - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. When the TDX module detects that EPT violations are caused by the same RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter() will take SEPT tree lock and contend with tdh_mem*(). - tdg_mem_page_accept() can contend with other tdh_mem*(). Sean provided several good suggestions[3], including: - Implement retries within TDX code when the TDP MMU returns RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering 0-step mitigation. - It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*() inside TDX module. - Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and prevent tdh_vp_enter() during page uninstallation. Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is insufficient to prevent 0-step mitigation [4]. Rick and Yan then consulted TDX module team with findings that: - The threshold of zero-step mitigation is counted per vCPU. It's of value 6 because "There can be at most 2 mapping faults on instruction fetch (x86 macro-instructions length is at most 15 bytes) when the instruction crosses page boundary; then there can be at most 2 mapping faults for each memory operand, when the operand crosses page boundary. For most of x86 macro-instructions, there are up to 2 memory operands and each one of them is small, which brings us to maximum 2+2*2 = 6 legal mapping faults." - Besides tdg_mem_page_accept(), tdg_mem_page_attr_rd/wr() can also contend with SEAMCALLs tdh_mem*(). So, we decided to make a proposal to tolerate 0-step mitigation. ==proposal details== The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs together which are required for page installation and uninstallation. These SEAMCALLs can be divided into three groups: Group 1: tdh_mem_page_add(). The SEAMCALL is invoked only during TD build time and therefore KVM has ensured that no contention will occur. Proposal: (as in patch 1) Just return error when TDX_OPERAND_BUSY is found. Group 2: tdh_mem_sept_add(), tdh_mem_page_aug(). These two SEAMCALLs are invoked for page installation. They return TDX_OPERAND_BUSY when contending with tdh_vp_enter() (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(), tdg_mem_page_attr_rd/wr(). Proposal: (as in patch 1) - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault(). - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as long as there are no pending signals/interrupts. The retry inside TDX aims to reduce the count of tdh_vp_enter() before resolving EPT violations in the local vCPU, thereby minimizing contentions with other vCPUs. However, it can't completely eliminate 0-step mitigation as it exits when there're pending signals/interrupts and does not (and cannot) remove the tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) tdh_mem_page_aug ------------------------------------------------------------ SEPT entry tdh_mem_sept_add (Host lock) tdh_mem_page_aug (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(). These three SEAMCALLs are invoked for page uninstallation, with KVM mmu_lock held for writing. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ TDCS epoch tdh_vp_enter tdh_mem_track ------------------------------------------------------------ SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) tdh_mem_range_block ------------------------------------------------------------ SEPT entry tdh_mem_range_block (Host lock) tdh_mem_page_remove (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Proposal: (as in patch 2) - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once. - During the retry, kick off all vCPUs and prevent any vCPU from entering to avoid potential contentions. This is because tdh_vp_enter() and TDCALLs are not protected by KVM mmu_lock, and remove_external_spte() in KVM must not fail. SEAMCALL Lock Type Resource -----------------------------Group 1-------------------------------- tdh_mem_page_add EXCLUSIVE TDR NO_LOCK TDCS NO_LOCK SEPT tree EXCLUSIVE page to add ----------------------------Group 2-------------------------------- tdh_mem_sept_add SHARED TDR SHARED TDCS SHARED SEPT tree HOST,EXCLUSIVE SEPT entry to modify EXCLUSIVE page to add tdh_mem_page_aug SHARED TDR SHARED TDCS SHARED SEPT tree HOST,EXCLUSIVE SEPT entry to modify EXCLUSIVE page to aug ----------------------------Group 3-------------------------------- tdh_mem_range_block SHARED TDR SHARED TDCS EXCLUSIVE SEPT tree HOST,EXCLUSIVE SEPT entry to modify tdh_mem_track SHARED TDR SHARED TDCS EXCLUSIVE TDCS epoch tdh_mem_page_remove SHARED TDR SHARED TDCS SHARED SEPT tree HOST,EXCLUSIVE SEPT entry to modify EXCLUSIVE page to remove [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@xxxxxxxxx [2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@xxxxxxxxxxxxxxxxxxxxxxxxx [3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@xxxxxxxxxx [4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@xxxxxxxxxxxxxxxxxxxxxxxxx Yan Zhao (2): KVM: TDX: Retry in TDX when installing TD private/sept pages KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/mmu/mmu.c | 2 +- arch/x86/kvm/vmx/tdx.c | 102 +++++++++++++++++++++++++------- 3 files changed, 85 insertions(+), 21 deletions(-) -- 2.43.2