On Thu, Nov 21, 2024 at 07:51:39PM +0800, Yan Zhao wrote: > This SEPT SEAMCALL retry proposal aims to remove patch > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" > [1] at the tail of v2 series "TDX MMU Part 2". > > ==brief history== > > In the v1 series 'TDX MMU Part 2', there were several discussions regarding > the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the > SEAMCALL wrapper tdx_seamcall_sept(). > > The lock status of each SEAMCALL relevant to KVM was analyzed in [2]. > > The conclusion was that 16 retries was necessary because > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > When the TDX module detects that EPT violations are caused by the same > RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter() > will take SEPT tree lock and contend with tdh_mem*(). > > - tdg_mem_page_accept() can contend with other tdh_mem*(). > > > Sean provided several good suggestions[3], including: > - Implement retries within TDX code when the TDP MMU returns > RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering > 0-step mitigation. > - It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*() > inside TDX module. > - Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and > prevent tdh_vp_enter() during page uninstallation. > > Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is > insufficient to prevent 0-step mitigation [4]. > > Rick and Yan then consulted TDX module team with findings that: > - The threshold of zero-step mitigation is counted per vCPU. > It's of value 6 because > > "There can be at most 2 mapping faults on instruction fetch > (x86 macro-instructions length is at most 15 bytes) when the > instruction crosses page boundary; then there can be at most 2 > mapping faults for each memory operand, when the operand crosses > page boundary. For most of x86 macro-instructions, there are up to 2 > memory operands and each one of them is small, which brings us to > maximum 2+2*2 = 6 legal mapping faults." > > - Besides tdg_mem_page_accept(), tdg_mem_page_attr_rd/wr() can also > contend with SEAMCALLs tdh_mem*(). > > So, we decided to make a proposal to tolerate 0-step mitigation. > > ==proposal details== > > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs > together which are required for page installation and uninstallation. > > These SEAMCALLs can be divided into three groups: > Group 1: tdh_mem_page_add(). > The SEAMCALL is invoked only during TD build time and therefore > KVM has ensured that no contention will occur. > > Proposal: (as in patch 1) > Just return error when TDX_OPERAND_BUSY is found. > > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug(). > These two SEAMCALLs are invoked for page installation. > They return TDX_OPERAND_BUSY when contending with tdh_vp_enter() > (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(), > tdg_mem_page_attr_rd/wr(). > > Proposal: (as in patch 1) > - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY > to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault(). > > - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as > long as there are no pending signals/interrupts. Alternatively, we can have the tdx_handle_ept_violation() do not retry internally TDX. Instead, keep the 16 times retries in tdx_seamcall_sept() for tdh_mem_sept_add() and tdh_mem_page_aug() only, i.e. only for SEAMCALLs in Group 2. > > The retry inside TDX aims to reduce the count of tdh_vp_enter() > before resolving EPT violations in the local vCPU, thereby > minimizing contentions with other vCPUs. However, it can't > completely eliminate 0-step mitigation as it exits when there're > pending signals/interrupts and does not (and cannot) remove the > tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) > tdh_mem_page_aug > ------------------------------------------------------------ > SEPT entry tdh_mem_sept_add (Host lock) > tdh_mem_page_aug (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(). > These three SEAMCALLs are invoked for page uninstallation, with > KVM mmu_lock held for writing. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > TDCS epoch tdh_vp_enter tdh_mem_track > ------------------------------------------------------------ > SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) > tdh_mem_range_block > ------------------------------------------------------------ > SEPT entry tdh_mem_range_block (Host lock) > tdh_mem_page_remove (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Proposal: (as in patch 2) > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only > once. > - During the retry, kick off all vCPUs and prevent any vCPU from > entering to avoid potential contentions. > > This is because tdh_vp_enter() and TDCALLs are not protected by > KVM mmu_lock, and remove_external_spte() in KVM must not fail. > > > > SEAMCALL Lock Type Resource > -----------------------------Group 1-------------------------------- > tdh_mem_page_add EXCLUSIVE TDR > NO_LOCK TDCS > NO_LOCK SEPT tree > EXCLUSIVE page to add > > ----------------------------Group 2-------------------------------- > tdh_mem_sept_add SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to add > > > tdh_mem_page_aug SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to aug > > ----------------------------Group 3-------------------------------- > tdh_mem_range_block SHARED TDR > SHARED TDCS > EXCLUSIVE SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > > tdh_mem_track SHARED TDR > SHARED TDCS > EXCLUSIVE TDCS epoch > > tdh_mem_page_remove SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to remove > > > [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@xxxxxxxxx > [2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@xxxxxxxxxxxxxxxxxxxxxxxxx > [3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@xxxxxxxxxx > [4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@xxxxxxxxxxxxxxxxxxxxxxxxx > > Yan Zhao (2): > KVM: TDX: Retry in TDX when installing TD private/sept pages > KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal > > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/vmx/tdx.c | 102 +++++++++++++++++++++++++------- > 3 files changed, 85 insertions(+), 21 deletions(-) > > -- > 2.43.2 >