[Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #2 from leiyang@xxxxxxxxxx ---
(In reply to Sean Christopherson from comment #1)
> On Wed, Dec 11, 2024, bugzilla-daemon@xxxxxxxxxx wrote:
> > I hit a bug on the intel host, this problem occurs randomly:
> > [  406.127925] ------------[ cut here ]------------
> > [  406.132572] WARNING: CPU: 52 PID: 12253 at
> arch/x86/kvm/mmu/tdp_mmu.c:1001
> > tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
> 
> Can you describe the host activity at the time of the WARN?  E.g. is it under
> memory pressure and potentially swapping, is KSM or NUMA balancing active? I
> have a sound theory for how the scenario occurs on KVM's end, but I still
> think
> it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this
> situation.

I spent some time for this problem so late reply. When host dmesg print this
error messages which running install a new guest via automation. And I found
this bug's reproducer is run this install case after the mchine first time
running(Let me introduce more to avoid ambiguity: 1. Must to test it when the
machine first time running this kernel,that mean's if I hit this problem then
reboot host, it can not reproduced again even if I run the same tests.  2.
Based on 1, I also must test this installation guest case, it can not
reporduced on other cases.). But through compare, this installation cases only
used pxe install based on a internal KS cfg is different other cases.

Sure, I think it's running under memory pressure and swapping. Based on
automation log, KSM is disable and I don't add NUMA in qemu command line. 

If you have a machine can clone avocado and run tp-qemu tests, you can prepare
env then run this case:
unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads

> 
> And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
> memory exposed to it?  E.g. an assigned device?  If so, can you provide the
> register
> state from the other WARNs?  If the PFNs are all in the same range, then
> maybe
> this is something funky with the VM_PFNMAP | VM_IO path.

I can confirm it has VM_IO because it runing installation case, VM is
constantly performing I/O operations

This my tests used memory and CPU configured, hope it help you debug this
problem:
-m 29G \
-smp 48,maxcpus=48,cores=24,threads=1,dies=1,sockets=2  \

And looks like there are no other device  and no using VM_PFNMAP. Please
correct me if I'm wrong.

> 
> The WARN is a sanity check I added because it should be impossible for KVM to
> install a non-writable SPTE overtop an existing writable SPTE.  Or so I
> thought.
> The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM
> correctly handles the unexpected change, the WARN is there purely to flag
> that
> something unexpected happen.
> 
>       if (new_spte == iter->old_spte)
>               ret = RET_PF_SPURIOUS;
>       else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
>               return RET_PF_RETRY;
>       else if (is_shadow_present_pte(iter->old_spte) &&
>                (!is_last_spte(iter->old_spte, iter->level) ||
>                 WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte,
> new_spte)))) <====
>               kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
> 
> Cross referencing the register state
> 
>   RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000
>   RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015
>   RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002
>   R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000
>   R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000
> 
> with the disassembly
> 
>   4885C8                          TEST RAX,RCX
>   0F84EEFEFFFF                    JE 0000000000000-F1
>   4985C8                          TEST R8,RCX
>   0F85E5FEFFFF                    JNE 0000000000000-F1
>   0F0B                            UD2
>   
> RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is:
> 
>   860000025e000bf7
>   800000025e0009f5
> 
> On Intel, bits 57 and 58 are the host-writable and MMU-writable flags
> 
>   #define EPT_SPTE_HOST_WRITABLE      BIT_ULL(57)
>   #define EPT_SPTE_MMU_WRITABLE               BIT_ULL(58)
> 
> which means KVM is overwriting a writable SPTE with a non-writable SPTE
> because
> the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b)
> retrieved
> a non-writable PFN from the primary MMU, and that fault raced with a WRITE
> fault
> on a different vCPU that retrieved and installed a writable PFN.
> 
> On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a
> writable PFN.
> Given that KVM has an valid writable SPTE, the corresponding PTE in the
> primary MMU
> *must* be writable, otherwise there's a missing mmu_notifier invalidation.
> 
>       /* map read fault as writable if possible */
>       if (!(flags & FOLL_WRITE) && kfp->map_writable &&
>           get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
>               put_page(page);
>               page = wpage;
>               flags |= FOLL_WRITE;
>       }
> 
> out:
>       *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE);
>       return npages;
> 
> Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast
> GUP,
> but they should be mutually exclusive with KVM having a writable SPTE.  If
> the
> mapping is truncated or the folio is swapped out, secondary MMUs need to be
> invalidated before folio->mapping is nullified.
> 
>       /*
>        * The mapping may have been truncated, in any case we cannot determine
>        * if this mapping is safe - fall back to slow path to determine how to
>        * proceed.
>        */
>       if (!mapping)
>               return false;
> 
> And secretmem can't be GUP'd, and it's not a long-term pin, so these checks
> don't
> apply either:
> 
>       if (check_secretmem && secretmem_mapping(mapping))
>               return false;
>       /* The only remaining allowed file system is shmem. */
>       return !reject_file_backed || shmem_mapping(mapping);
> 
> Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is
> writable
> in the primary MMU, regardless of the fault type.
> 
> If this turns out to get a legitimate scenario, then I think it makes sense
> to
> add an is_access_allowed() check and treat the fault as spurious.  But I
> would
> like to try to bottom out on what exactly is happening, because I'm mildly
> concerned something is buggy in the primary MMU.

If you need me to provide more info, please feel free to let me know. And if
you  sent a patch to fix this problem I can help to verified it, since I think
I found the stable reproducer.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux