This all started from an offhand comment wondering if we might be dropping A/D bits. Turns out, yes, we are dropping A/D bits (and W bits) in the TDP MMU. Attempting to prove that bug exposed two other notable goofs, the first of which has been lurking for a decade, give or take: Patch 1 - KVM treats _all_ MMU-writable SPTEs as volatile, even though KVM never clears WRITABLE outside of MMU lock. As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to update writable SPTEs. Patch 4 - KVM sends literally all EPT violations down the fast page fault handler. The bug was introduced when KVM started honoring EPTP12's A/D enabling, circa 2016. Neither fix seems to have an easily-measurable affect on performance; page faults are so slow that wasting even a few hundred cycles is dwarfed by the base cost. The rest of the series is cleanups, including a v2 of my RET_PF_CONTINUE, which would conflict in mildly annoying ways. The first two "DO NOT MERGE" patches are effectively a PoC showing that sending indirect MMUs down the fast page fault handler can work, it's just ineffective. I went down this rabbit hole after I saw a massive performance drop in the legacy MMU from patch 3 and thought it might be due to detecting spurious faults out of mmu_lock. The TDP MMU didn't exhibit the same problem, so it just _had_ to be write-contention on mmu_lock. Right!?!? Wrong. Turns out that spurious faults are a non-issue, but inverting the "ad_disabled" flag in the shadow page role and preventing reuse? Yep, that's a wee bit of an issue. I included the patches here to publicly document that this has been tried, it works, it's just likely not a win. The "DO NOT MERGE" selftest is my attempt to prove that the TDP MMU can drop W/A/D bits. It does work in the sense that it can trigger dropped bits, but it's fairly worthless as a selftest because it requires an explicit WARN in KVM :-/ Posted mostly as a reminder to myself that the primary MMU write-protects on writeback (this isn't the first time I've gone spelunking in that code...). E.g. it can trigger the WARN on WRITABLE and DIRTY bits this patch on top of the series: diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index f0af385c56e0..4f1ec56460ad 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -32,6 +32,8 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte) static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte, u64 new_spte, int level) { + u64 current_spte; + /* * Atomically write the SPTE if it is a shadow-present, leaf SPTE with * volatile bits, i.e. has bits that can be set outside of mmu_lock. @@ -46,7 +48,13 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte, */ if (is_shadow_present_pte(old_spte) && is_last_spte(old_spte, level) && spte_has_volatile_bits(old_spte)) - return kvm_tdp_mmu_write_spte_atomic(sptep, new_spte); + ; + + current_spte = kvm_tdp_mmu_read_spte(sptep); + WARN_ONCE((old_spte ^ current_spte) & current_spte & + (PT_WRITABLE_MASK | PT_ACCESSED_MASK | PT_DIRTY_MASK), + "KVM: Dropped SPTE W/A/D bits: 0x%llx\n", + (old_spte ^ current_spte) & current_spte); __kvm_tdp_mmu_write_spte(sptep, new_spte); return old_spte; ------------[ cut here ]------------ KVM: Dropped SPTE W/A/D bits: 0x2 WARNING: CPU: 5 PID: 874 at arch/x86/kvm/mmu/tdp_iter.h:56 __tdp_mmu_set_spte+0x18b/0x1a0 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 5 PID: 874 Comm: volatile_spte_t Not tainted 5.18.0-rc1+ #830 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__tdp_mmu_set_spte+0x18b/0x1a0 [kvm] Call Trace: <TASK> tdp_mmu_zap_leafs+0xbc/0x180 [kvm] kvm_tdp_mmu_zap_leafs+0x74/0x90 [kvm] kvm_unmap_gfn_range+0xe8/0x100 [kvm] kvm_mmu_notifier_invalidate_range_start+0x11c/0x2c0 [kvm] __mmu_notifier_invalidate_range_start+0x7e/0x190 change_protection+0x657/0xb30 mprotect_fixup+0x1a1/0x310 do_mprotect_pkey+0x1f9/0x3a0 __x64_sys_mprotect+0x1b/0x20 do_syscall_64+0x31/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ---[ end trace 0000000000000000 ]--- Sean Christopherson (12): KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" KVM: x86/mmu: Make all page fault handlers internal to the MMU KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults KVM: x86/mmu: Expand and clean up page fault stats DO NOT MERGE: KVM: x86/mmu: Always send !PRESENT faults down the fast path DO NOT MERGE: KVM: x86/mmu: Use fast_page_fault() to detect spurious shadow MMU faults DO NOT MERGE: KVM: selftests: Attempt to detect lost dirty bits arch/x86/include/asm/kvm_host.h | 5 + arch/x86/kvm/mmu.h | 87 -------- arch/x86/kvm/mmu/mmu.c | 211 ++++++++++-------- arch/x86/kvm/mmu/mmu_internal.h | 123 +++++++++- arch/x86/kvm/mmu/mmutrace.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 15 +- arch/x86/kvm/mmu/spte.c | 28 +++ arch/x86/kvm/mmu/spte.h | 15 +- arch/x86/kvm/mmu/tdp_iter.h | 34 ++- arch/x86/kvm/mmu/tdp_mmu.c | 92 +++++--- arch/x86/kvm/x86.c | 24 +- tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 4 + .../selftests/kvm/volatile_spte_test.c | 208 +++++++++++++++++ 14 files changed, 600 insertions(+), 248 deletions(-) create mode 100644 tools/testing/selftests/kvm/volatile_spte_test.c base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3 -- 2.36.0.rc2.479.g8af0fa9b8e-goog