On 12/9/2024 9:07 AM, Binbin Wu wrote:
From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
from host VMM.
Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
For TDX guests, APICv is always enabled by TDX module. But in current TDX basic support patch series, TDX code inhibits APICv for TDX guests from the view of KVM. Synced with Isaku, the reason was to prevent the APICv active state from toggling during runtime. Sean raised the concern in a PUCK session that it is not concept right to "lie" to KVM that APICv is disabled while it is actually enabled. Instead, it's better to make APICv enabled and prevent it from being disabled from the view of KVM. Following is the analysis about the APICv active state for TDX to kick off further discussions. APICv active state ================== From the view of KVM, whether APICv state is active or not is decided by: 1. APIC is hw enabled 2. VM and vCPU have no inhibit reasons set. APIC hw enabled --------------- After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}. - KVM_SET_LAPIC Currently, KVM allows userspace to
request KVM_SET_LAPIC to set the state of LAPIC for TDX guests. There are two options: - Force x2APIC mode and default base address when userspace request KVM_SET_LAPIC. - Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected is true), since migration is not supported yet. Choose option 2 for simplicity for now. - KVM_SET_{SREGS, SREGS2} KVM rejects userspace to set APIC base when vcpu->kvm->arch.has_protected_state and vcpu->arch.guest_state_protected are both set. Currently for TDX, kvm->arch.has_protected_state is not set, so userspace is allowed to modify APIC base. There are three options: - Reject KVM_SET_{SREGS, SREGS2} when either vcpu->arch.guest_state_protected or vcpu->kvm->arch.has_protected_state is set. - Check vcpu->arch.guest_state_protected before kvm_apic_set_base() in __set_sregs_common(). - Set has_protected_state for TDX guests. Choose option 3, i.e. to set has_protected_state for TDX guests, aligning with SEV/SNP. APICv inhibit reasons
--------------------- APICv could be disabled due to a few inhibit reasons. - APICV_INHIBIT_REASON_DISABLED For TDX, this could be triggered when the module parameter enable_apicv is set to false. enable_apicv could be checked in tdx_bringup(). Disable TDX support if !enable_apicv. So that APICV_INHIBIT_REASON_DISABLED will not be set during runtime and apic->apicv_active is initialized to true. - APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED KVM will reject userspace to modify APIC base, i.e., APIC mode will always be x2APIC mode, the only reason this could be set is it fails to allocate memory for KVM apic map. - APICV_INHIBIT_REASON_PIT_REINJ Based on current code, this is relevant only to AMD's AVIC, so this reason will not be set for TDX guests. However, KVM is also not be able to intercept EOI for TDX guests. For TDX, if in-kernel PIT is enabled and in re-inject mode, the use of PIT in guest may have problem. Fortunately, modern OSes don't use PIT. Options: - Enforce irqchip
split for TDX guests, i.e. in-kernel PIT is not supported. - Leave it as it is and expect PIT will not be used. - Reasons will not be set for TDX - APICV_INHIBIT_REASON_HYPERV TDX doesn't support HyperV guest yet. - APICV_INHIBIT_REASON_ABSENT In-kernel LAPIC is checked in tdx_vcpu_create(). - APICV_INHIBIT_REASON_BLOCKIRQ TDX doesn't support KVM_SET_GUEST_DEBUG. - APICV_INHIBIT_REASON_APIC_ID_MODIFIED KVM will reject userspace to modify APIC base, i.e., APIC mode will always be x2APIC mode. - APICV_INHIBIT_REASON_APIC_BASE_MODIFIED KVM will reject userspace to set APIC base. - Reasons relevant only to AMD's AVIC - APICV_INHIBIT_REASON_NESTED, - APICV_INHIBIT_REASON_IRQWIN, - APICV_INHIBIT_REASON_SEV, - APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED. Summary about APICv inhibit reasons: APICv could still be disabled runtime in some corner case, e.g, APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure. After checking enable_apicv in tdx_bringup(),
apic->apicv_active is initialized as true in kvm_create_lapic(). If APICv is inhibited due to any reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check if APICv is disabled for TDX, if APICv is disabled, bug the VM. Changes of APICv active from false to true ========================================== Lazy check for pending APIC EOI when In-kernel IOAPIC ----------------------------------------------------- In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor accelerates write to APIC EOI register and does not trap if the interrupt is edge-triggered. So there is a workaround by lazy check for pending APIC EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no pending APIC EOI. KVM is also not be able to intercept EOI for TDX guests. - When APICv is enabled The code of lazy check for pending APIC EOI doesn't work for TDX because KVM can't get the status of real IRR and ISR, and the values are 0s in vIRR and vISR
in apic->regs[], kvm_apic_pending_eoi() will always return false. So the RTC pending EOI will always be cleared when ioapic_set_irq() is called for RTC. Then userspace may miss the coalesced RTC interrupts. - When When APICv is disabled ioapic_lazy_update_eoi() will not be called,then pending EOI status for RTC will not be cleared after setting and this will mislead userspace to see coalesced RTC interrupts. Options: - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC. - Leave it as it is, but the use of RTC may not be accurate. kvm_can_post_timer_interrupt() ------------------------------ Whether housekeeping CPU can deliver timer interrupt to target vCPU via posted interrupt when nohz_full option set. - When APICv active is false, it always return false. - When APICv active is true, it also depends on whether mwait or hlt in guest is set. For TDX guests, hlt will trigger #VE unconditionally and TDX guests request HLT via TDVMCALL. Whether mwait is
allowed depends on the cpuid configuration in TD_PARAMS. So current implementation of kvm_mwait_in_guest() and kvm_hlt_in_guest() doesn't reflect the real status for TDX guests. However, Sean mentioned "consulting kvm_can_post_timer_interrupt() in the expiration path is silly". There could be cleanups for this part. https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@xxxxxxxxxx/ So, don't do any TDX-specific logic for it. apic_timer_expired() -------------------- About kvm_can_post_timer_interrupt() in the expiration path, see the description above. For the rest part, when the function is not called from timer function - If apicv_active, the timer interrupt will be injected via kvm_apic_inject_pending_timer_irqs(). - If !apicv_active, the timer interrupt will be handled via lapic_timer.pending approach, and finally, the timer interrupt is also be injected via kvm_apic_inject_pending_timer_irqs(). Basically, they are functionally equivalent with subtle differences. E.g., if an
hrtimer fires while KVM is handling a write to TMICT, KVM will deliver the interrupt if configured to post timer, but not if APICv is disabled, because the latter will increment "pending", and "pending" will be cleared before handling the new TMICT. Ditto for switch APIC timer modes. Sean mentioned the entire lapic_timer.pending approach may need to be ditched, and the timer interrupt could be directly delivered no matter apicv is active or not. https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@xxxxxxxxxx/ This is not TDX specific, leave it for now. Options: - Fix kvm_mwait_in_guest()/kvm_hlt_in_guest() for TDX guests. - VMX preemption timer can't be used by TDX guests anyway, leave kvm_mwait_in_guest()/kvm_hlt_in_guest() as them are, posted timer interrupt could be used when userspace requested to disable exit for mwait/hlt. - VMX preemption timer can't be used by TDX guests anyway, skip checking kvm_mwait_in_guest()/kvm_hlt_in_guest(). kvm_arch_dy_has_pending_interrupt()
----------------------------------- Before enabling off-TD debug, there is no functional change because there is no PAUSE Exit for TDX guests. After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true to get the pending interrupt from PID. Set APICv to active for TDX is the right thing to do. update_cr8_intercept() ---------------------- Functionally unchanged because the callback update_cr8_intercept() for TDX is ignored. Set APICv to active for TDX can return earlier to skip unnecessary code. kvm_lapic_reset() kvm_apic_set_state() -------------------- The callbacks apicv_post_state_restore(), hwapic_irr_update(), and hwapic_isr_update() will be called for TDX guests when apicv is active, these callbacks have been ignored by TDX code already, no functional changes. Issues ====== PIC interrupts -------------- KVM inject PIC interrupt via event injection path. Currently, TDX code doesn't handle this, thus PIC interrupts will be lost. Fortunately, modern OSes
don't use PIC. We could use posted-interrupt in to deliver PIC interrupt if needed. Or can we assume PIC will not be used by TDX guests? In-kernel PIT in re-inject mode ------------------------------- See the description for "APICV_INHIBIT_REASON_PIT_REINJ" above. Lazy check for pending APIC EOI of In-kernel IOAPIC --------------------------------------------------- See the description for the same item in "Changes of APICv active from false to true". Open: For the issues related to in-kernel PIT and in-kernel IOAPIC, should KVM force irqchip split for TDX guests to eliminate the use of in-kernel PIT and in-kernel IOAPIC? Proposed code change ==================== Below is the proposed code change to change APICv active from false to true for TDX guests. Force irqchip split for TEX guests is not included. Note, by rejecting KVM_GET_LAPIC/KVM_SET_LAPIC for TDX guests (i.e., when guest_apic_protected), it returns an error code instead of returning 0. It requires modifications in
QEMU TDX support code to avoid requesting KVM_GET_LAPIC/KVM_SET_LAPIC. 8<---------------------------------------------------------------------------- diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 0787855ab006..97025a240d54 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1289,15 +1289,6 @@ enum kvm_apicv_inhibit { */ APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED, - /*********************************************************/ - /* INHIBITs that are relevant only to the Intel's APICv. */ - /*********************************************************/ - - /* - * APICv is disabled because TDX doesn't support it. - */ - APICV_INHIBIT_REASON_TDX, - NR_APICV_INHIBIT_REASONS, }; @@ -1316,8 +1307,7 @@ enum kvm_apicv_inhibit { __APICV_INHIBIT_REASON(IRQWIN), \ __APICV_INHIBIT_REASON(PIT_REINJ), \ __APICV_INHIBIT_REASON(SEV), \ - __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED), \ - __APICV_INHIBIT_REASON(TDX) +
__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED) struct kvm_arch { unsigned long n_used_mmu_pages; diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c index 9b79b4bb063f..df9cc4a7f2d8 100644 --- a/arch/x86/kvm/vmx/main.c +++ b/arch/x86/kvm/vmx/main.c @@ -782,8 +782,10 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu) static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) { - if (WARN_ON_ONCE(is_td_vcpu(vcpu))) + if (is_td_vcpu(vcpu)) { + KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm); return; + } vmx_refresh_apicv_exec_ctrl(vcpu); } @@ -908,8 +910,7 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \ BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \ BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \ - BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \ - BIT(APICV_INHIBIT_REASON_TDX)) + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED)) struct kvm_x86_ops vt_x86_ops __initdata = { .name =
KBUILD_MODNAME, diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 67fc391fe798..cc516ab2d990 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -614,6 +614,7 @@ int tdx_vm_init(struct kvm *kvm) struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); kvm->arch.has_private_mem = true; + kvm->arch.has_protected_state = true; /* * Because guest TD is protected, VMM can't parse the instruction in TD. @@ -2354,8 +2355,6 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params, goto teardown; } - kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX); - return 0; /* @@ -2741,7 +2740,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx) return -EIO; } - vcpu->arch.apic->apicv_active = false; vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; return 0; @@ -3273,6 +3271,11 @@ int __init tdx_bringup(void) goto success_disable_tdx; } + if (!enable_apicv) { + pr_err("APICv is required for TDX\n"); + goto success_disable_tdx; + } + if
(!tdp_mmu_enabled || !enable_mmio_caching) { pr_err("TDP MMU and MMIO caching is required for TDX\n"); goto success_disable_tdx; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e433c8ee63a5..837a287d8c47 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5108,6 +5108,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { + if (vcpu->arch.apic->guest_apic_protected) + return -EINVAL; + kvm_x86_call(sync_pir_to_irr)(vcpu); return kvm_apic_get_state(vcpu, s); @@ -5118,6 +5121,9 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, { int r; + if (vcpu->arch.apic->guest_apic_protected) + return -EINVAL; + r = kvm_apic_set_state(vcpu, s); if (r) return r;