https://bugzilla.kernel.org/show_bug.cgi?id=219588 --- Comment #2 from leiyang@xxxxxxxxxx --- (In reply to Sean Christopherson from comment #1) > On Wed, Dec 11, 2024, bugzilla-daemon@xxxxxxxxxx wrote: > > I hit a bug on the intel host, this problem occurs randomly: > > [ 406.127925] ------------[ cut here ]------------ > > [ 406.132572] WARNING: CPU: 52 PID: 12253 at > arch/x86/kvm/mmu/tdp_mmu.c:1001 > > tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] > > Can you describe the host activity at the time of the WARN? E.g. is it under > memory pressure and potentially swapping, is KSM or NUMA balancing active? I > have a sound theory for how the scenario occurs on KVM's end, but I still > think > it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this > situation. I spent some time for this problem so late reply. When host dmesg print this error messages which running install a new guest via automation. And I found this bug's reproducer is run this install case after the mchine first time running(Let me introduce more to avoid ambiguity: 1. Must to test it when the machine first time running this kernel,that mean's if I hit this problem then reboot host, it can not reproduced again even if I run the same tests. 2. Based on 1, I also must test this installation guest case, it can not reporduced on other cases.). But through compare, this installation cases only used pxe install based on a internal KS cfg is different other cases. Sure, I think it's running under memory pressure and swapping. Based on automation log, KSM is disable and I don't add NUMA in qemu command line. If you have a machine can clone avocado and run tp-qemu tests, you can prepare env then run this case: unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads > > And does the VM have device memory or any other type of VM_PFNMAP or VM_IO > memory exposed to it? E.g. an assigned device? If so, can you provide the > register > state from the other WARNs? If the PFNs are all in the same range, then > maybe > this is something funky with the VM_PFNMAP | VM_IO path. I can confirm it has VM_IO because it runing installation case, VM is constantly performing I/O operations This my tests used memory and CPU configured, hope it help you debug this problem: -m 29G \ -smp 48,maxcpus=48,cores=24,threads=1,dies=1,sockets=2 \ And looks like there are no other device and no using VM_PFNMAP. Please correct me if I'm wrong. > > The WARN is a sanity check I added because it should be impossible for KVM to > install a non-writable SPTE overtop an existing writable SPTE. Or so I > thought. > The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM > correctly handles the unexpected change, the WARN is there purely to flag > that > something unexpected happen. > > if (new_spte == iter->old_spte) > ret = RET_PF_SPURIOUS; > else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte)) > return RET_PF_RETRY; > else if (is_shadow_present_pte(iter->old_spte) && > (!is_last_spte(iter->old_spte, iter->level) || > WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte, > new_spte)))) <==== > kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level); > > Cross referencing the register state > > RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000 > RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015 > RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002 > R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000 > R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000 > > with the disassembly > > 4885C8 TEST RAX,RCX > 0F84EEFEFFFF JE 0000000000000-F1 > 4985C8 TEST R8,RCX > 0F85E5FEFFFF JNE 0000000000000-F1 > 0F0B UD2 > > RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is: > > 860000025e000bf7 > 800000025e0009f5 > > On Intel, bits 57 and 58 are the host-writable and MMU-writable flags > > #define EPT_SPTE_HOST_WRITABLE BIT_ULL(57) > #define EPT_SPTE_MMU_WRITABLE BIT_ULL(58) > > which means KVM is overwriting a writable SPTE with a non-writable SPTE > because > the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b) > retrieved > a non-writable PFN from the primary MMU, and that fault raced with a WRITE > fault > on a different vCPU that retrieved and installed a writable PFN. > > On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a > writable PFN. > Given that KVM has an valid writable SPTE, the corresponding PTE in the > primary MMU > *must* be writable, otherwise there's a missing mmu_notifier invalidation. > > /* map read fault as writable if possible */ > if (!(flags & FOLL_WRITE) && kfp->map_writable && > get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) { > put_page(page); > page = wpage; > flags |= FOLL_WRITE; > } > > out: > *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE); > return npages; > > Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast > GUP, > but they should be mutually exclusive with KVM having a writable SPTE. If > the > mapping is truncated or the folio is swapped out, secondary MMUs need to be > invalidated before folio->mapping is nullified. > > /* > * The mapping may have been truncated, in any case we cannot determine > * if this mapping is safe - fall back to slow path to determine how to > * proceed. > */ > if (!mapping) > return false; > > And secretmem can't be GUP'd, and it's not a long-term pin, so these checks > don't > apply either: > > if (check_secretmem && secretmem_mapping(mapping)) > return false; > /* The only remaining allowed file system is shmem. */ > return !reject_file_backed || shmem_mapping(mapping); > > Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is > writable > in the primary MMU, regardless of the fault type. > > If this turns out to get a legitimate scenario, then I think it makes sense > to > add an is_access_allowed() check and treat the fault as spurious. But I > would > like to try to bottom out on what exactly is happening, because I'm mildly > concerned something is buggy in the primary MMU. If you need me to provide more info, please feel free to let me know. And if you sent a patch to fix this problem I can help to verified it, since I think I found the stable reproducer. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.