https://bugzilla.kernel.org/show_bug.cgi?id=219588 --- Comment #1 from Sean Christopherson (seanjc@xxxxxxxxxx) --- On Wed, Dec 11, 2024, bugzilla-daemon@xxxxxxxxxx wrote: > I hit a bug on the intel host, this problem occurs randomly: > [ 406.127925] ------------[ cut here ]------------ > [ 406.132572] WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 > tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] Can you describe the host activity at the time of the WARN? E.g. is it under memory pressure and potentially swapping, is KSM or NUMA balancing active? I have a sound theory for how the scenario occurs on KVM's end, but I still think it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this situation. And does the VM have device memory or any other type of VM_PFNMAP or VM_IO memory exposed to it? E.g. an assigned device? If so, can you provide the register state from the other WARNs? If the PFNs are all in the same range, then maybe this is something funky with the VM_PFNMAP | VM_IO path. The WARN is a sanity check I added because it should be impossible for KVM to install a non-writable SPTE overtop an existing writable SPTE. Or so I thought. The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM correctly handles the unexpected change, the WARN is there purely to flag that something unexpected happen. if (new_spte == iter->old_spte) ret = RET_PF_SPURIOUS; else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte)) return RET_PF_RETRY; else if (is_shadow_present_pte(iter->old_spte) && (!is_last_spte(iter->old_spte, iter->level) || WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte, new_spte)))) <==== kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level); Cross referencing the register state RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015 RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002 R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000 R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000 with the disassembly 4885C8 TEST RAX,RCX 0F84EEFEFFFF JE 0000000000000-F1 4985C8 TEST R8,RCX 0F85E5FEFFFF JNE 0000000000000-F1 0F0B UD2 RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is: 860000025e000bf7 800000025e0009f5 On Intel, bits 57 and 58 are the host-writable and MMU-writable flags #define EPT_SPTE_HOST_WRITABLE BIT_ULL(57) #define EPT_SPTE_MMU_WRITABLE BIT_ULL(58) which means KVM is overwriting a writable SPTE with a non-writable SPTE because the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b) retrieved a non-writable PFN from the primary MMU, and that fault raced with a WRITE fault on a different vCPU that retrieved and installed a writable PFN. On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a writable PFN. Given that KVM has an valid writable SPTE, the corresponding PTE in the primary MMU *must* be writable, otherwise there's a missing mmu_notifier invalidation. /* map read fault as writable if possible */ if (!(flags & FOLL_WRITE) && kfp->map_writable && get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) { put_page(page); page = wpage; flags |= FOLL_WRITE; } out: *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE); return npages; Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast GUP, but they should be mutually exclusive with KVM having a writable SPTE. If the mapping is truncated or the folio is swapped out, secondary MMUs need to be invalidated before folio->mapping is nullified. /* * The mapping may have been truncated, in any case we cannot determine * if this mapping is safe - fall back to slow path to determine how to * proceed. */ if (!mapping) return false; And secretmem can't be GUP'd, and it's not a long-term pin, so these checks don't apply either: if (check_secretmem && secretmem_mapping(mapping)) return false; /* The only remaining allowed file system is shmem. */ return !reject_file_backed || shmem_mapping(mapping); Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is writable in the primary MMU, regardless of the fault type. If this turns out to get a legitimate scenario, then I think it makes sense to add an is_access_allowed() check and treat the fault as spurious. But I would like to try to bottom out on what exactly is happening, because I'm mildly concerned something is buggy in the primary MMU. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.