This is a lock status report of TDX module for current SEAMCALL retry issue based on code in TDX module public repo https://github.com/intel/tdx-module.git branch TDX_1.5.05. TL;DR: - tdh_mem_track() can contend with tdh_vp_enter(). - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. - tdg_mem_page_accept() can contend with other tdh_mem*(). Proposal: - Return -EAGAIN directly in ops link_external_spt/set_external_spte when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY. - Kick off vCPUs at the beginning of page removal path, i.e. before the tdh_mem_range_block(). Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done. (one possible optimization: since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare, do not kick off vCPUs in normal conditions. When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow TD enter until page removal completes.) Below are detailed analysis: === Background === In TDX module, there are 4 kinds of locks: 1. sharex_lock: Normal read/write lock. (no host priority stuff) 2. sharex_hp_lock: Just like normal read/write lock, except that host can set host priority bit on failure. when guest tries to acquire the lock and sees host priority bit set, it will return "busy host priority" directly, letting host win. After host acquires the lock successfully, host priority bit is cleared. 3. sept entry lock: Lock utilizing software bits in SEPT entry. HP bit (Host priority): bit 52 EL bit (Entry lock): bit 11, used as a bit lock. - host sets HP bit when host fails to acquire EL bit lock; - host resets HP bit when host wins. - guest returns "busy host priority" if HP bit is found set when guest tries to acquire EL bit lock. 4. mutex lock: Lock with only 2 states: free, lock. (not the same as linux mutex, not re-scheduled, could pause() for debugging). ===Resources & users list=== Resources SHARED users EXCLUSIVE users ------------------------------------------------------------------------ (1) TDR tdh_mng_rdwr tdh_mng_create tdh_vp_create tdh_mng_add_cx tdh_vp_addcx tdh_mng_init tdh_vp_init tdh_mng_vpflushdone tdh_vp_enter tdh_mng_key_config tdh_vp_flush tdh_mng_key_freeid tdh_vp_rd_wr tdh_mr_extend tdh_mem_sept_add tdh_mr_finalize tdh_mem_sept_remove tdh_vp_init_apicid tdh_mem_page_aug tdh_mem_page_add tdh_mem_page_remove tdh_mem_range_block tdh_mem_track tdh_mem_range_unblock tdh_phymem_page_reclaim ------------------------------------------------------------------------ (2) KOT tdh_phymem_cache_wb tdh_mng_create tdh_mng_vpflushdone tdh_mng_key_freeid ------------------------------------------------------------------------ (3) TDCS tdh_mng_rdwr tdh_vp_create tdh_vp_addcx tdh_vp_init tdh_vp_init_apicid tdh_vp_enter tdh_vp_rd_wr tdh_mem_sept_add tdh_mem_sept_remove tdh_mem_page_aug tdh_mem_page_remove tdh_mem_range_block tdh_mem_track tdh_mem_range_unblock ------------------------------------------------------------------------ (4) TDVPR tdh_vp_rd_wr tdh_vp_create tdh_vp_addcx tdh_vp_init tdh_vp_init_apicid tdh_vp_enter tdh_vp_flush ------------------------------------------------------------------------ (5) TDCS epoch tdh_vp_enter tdh_mem_track ------------------------------------------------------------------------ (6) secure_ept_lock tdh_mem_sept_add tdh_vp_enter tdh_mem_page_aug tdh_mem_sept_remove tdh_mem_page_remove tdh_mem_range_block tdh_mem_range_unblock ------------------------------------------------------------------------ (7) SEPT entry tdh_mem_sept_add tdh_mem_sept_remove tdh_mem_page_aug tdh_mem_page_remove tdh_mem_range_block tdh_mem_range_unblock tdg_mem_page_accept Current KVM interested SEAMCALLs: ------------------------------------------------------------------------ SEAMCALL Lock Name Lock Type Resource tdh_mng_create sharex_hp_lock EXCLUSIVE TDR sharex_lock EXCLUSIVE KOT tdh_mng_add_cx sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock EXCLUSIVE page to add tdh_mng_init sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock NO_LOCK TDCS tdh_mng_vpflushdone sharex_hp_lock EXCLUSIVE TDR sharex_lock EXCLUSIVE KOT tdh_mng_key_config sharex_hp_lock EXCLUSIVE TDR tdh_mng_key_freeid sharex_hp_lock EXCLUSIVE TDR sharex_lock EXCLUSIVE KOT tdh_mng_rdwr sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS tdh_mr_extend sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock NO_LOCK TDCS tdh_mr_finalize sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock NO_LOCK TDCS tdh_vp_create sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_hp_lock EXCLUSIVE TDVPR tdh_vp_addcx sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_hp_lock EXCLUSIVE page to add tdh_vp_init sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS tdh_vp_init_apicid sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock SHARED TDCS tdh_vp_enter(*) sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock SHARED TDCS epoch_lock sharex_lock EXCLUSIVE TDCS secure_ept_lock tdh_vp_flush sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock SHARED TDR tdh_vp_rd_wr sharex_hp_lock SHARED TDVPR sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS tdh_mem_sept_add sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock SHARED TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify sharex_hp_lock EXCLUSIVE page to add tdh_mem_sept_remove sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock EXCLUSIVE TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify sharex_hp_lock EXCLUSIVE page to remove tdh_mem_page_add sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock NO_LOCK TDCS sharex_lock NO_LOCK TDCS secure_ept_lock sharex_hp_lock EXCLUSIVE page to add tdh_mem_page_aug sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock SHARED TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify sharex_hp_lock EXCLUSIVE page to aug tdh_mem_page_remove sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock SHARED TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify sharex_hp_lock EXCLUSIVE page to remove tdh_mem_range_block sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock EXCLUSIVE TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify tdh_mem_track sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock EXCLUSIVE TDCS epoch_lock tdh_mem_range_unblock sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock EXCLUSIVE TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify tdh_phymem_page_reclaim sharex_hp_lock EXCLUSIVE page to reclaim sharex_hp_lock SHARED TDR tdh_phymem_cache_wb mutex_lock per package wbt_entries sharex_lock SHARED KOT tdh_phymem_page_wbinvd sharex_hp_lock SHARED page to be wbinvd Current KVM interested TDCALLs: ------------------------------------------------------------------------ tdg_mem_page_accept sept entry lock GUEST SEPT entry to modify TDCALLs like tdg_mr_rtmr_extend(), tdg_servtd_rd_wr(), tdg_mem_page_attr_wr() tdg_mem_page_attr_rd() are not included. *:(a) tdh_vp_enter() holds shared TDR lock and exclusive TDVPR lock, the two locks are released when exiting to VMM. (b) tdh_vp_enter() holds shared TDCS lock and shared TDCS epoch_lock lock, releases them before entering non-root mode. (c) tdh_vp_enter() holds shared epoch lock, contending with tdh_mem_track(). (d) tdh_vp_enter() only holds EXCLUSIVE secure_ept_lock when 0-stepping is suspected, i.e. when last_epf_gpa_list is not empty. When a EPT violation happens, TDX module checks if the guest RIP equals to the guest RIP of last TD entry. Only when this is true for 6 continuous times, the gpa will be recorded in last_epf_gpa_list. The list will be reset once guest RIP of a EPT violation and last TD enter RIP are unequal. === Summary === For the 8 kinds of common resources protected in TDX module: (1) TDR: There are only shared accesses to TDR during runtime (i.e. after TD is finalized and before TD tearing down), if we don't need to support calling tdh_vp_init_apicid() at runtime (e.g. for vCPU hotplug). tdh_vp_enter() holds shared TDR lock until exiting to VMM. TDCALLs do not acquire the TDR lock. (2) KOT (Key Ownership Table) Current KVM code should have avoided contention to this resource. (3) TDCS: Shared accessed or access with no lock when TDR is exclusively locked. Seamcalls in runtime (after TD finalized and before TD tearing down) do not contend with each other on TDCS. tdh_vp_enter() holds shared TDCS lock and releases it before entering non-root mode. Current TDCALLs for basic TDX do not acquire this lock. (4) TDVPR: Per-vCPU exclusive accessed except for tdh_vp_rd_wr(). tdh_vp_enter() holds exclusive TDVPR lock until exiting to VMM. TDCALLs do not acquire the TDVPR lock. (5) TDCS epoch: tdh_mem_track() requests exclusive access, and tdh_vp_enter() requests shared access. tdh_mem_track() can contend with tdh_vp_enter(). (6) SEPT tree: Protected by secure_ept_lock (sharex_lock). tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() holds shared lock; tdh_mem_sept_remove()/tdh_mem_range_block()/tdh_mem_range_unblock() holds exclusive lock. tdh_vp_enter() requests exclusive access when 0-stepping is suspected, contending with all other tdh_mem*(). Guest does not acquire this lock. So, kvm mmu_lock has prevented contentions between tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() and tdh_mem_sept_remove()/tdh_mem_range_block(). Though tdh_mem_sept_add()/tdh_mem_page_aug() races with tdh_vp_enter(), returning -EAGAIN directly is fine for them. The remaining issue is the contention between tdh_vp_enter() and tdh_mem_page_remove()/tdh_mem_sept_remove()/tdh_mem_range_block(). (7) SEPT entry: All exclusive access. tdg_mem_page_accept() may contend with other tdh_mem*() on a specific SEPT entry. (8) PAMT entry for target pages (e.g. page to add/aug/remove/reclaim/wbinvd): Though they are all exclusively locked, no race should be met as long as they belong to different pamt entries. Conclusion: Current KVM code should have avoided contentions of resources (1)-(4),(8), while (5),(6),(7) are still possible to meet contention. - tdh_mem_track() can contend with tdh_vp_enter() for (5) - tdh_vp_enter() contends with tdh_mem*() for (6) when 0-stepping is suspected. - tdg_mem_page_accept() can contend with other tdh_mem*() for (7).