[Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]

bugzilla-daemon@xxxxxxxxxx · Mon, 16 Dec 2024 05:42:30 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #2 from leiyang@xxxxxxxxxx ---
(In reply to Sean Christopherson from comment #1)
> On Wed, Dec 11, 2024, bugzilla-daemon@xxxxxxxxxx wrote:
> > I hit a bug on the intel host, this problem occurs randomly:
> > [  406.127925] ------------[ cut here ]------------
> > [  406.132572] WARNING: CPU: 52 PID: 12253 at
> arch/x86/kvm/mmu/tdp_mmu.c:1001
> > tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
> 
> Can you describe the host activity at the time of the WARN?  E.g. is it under
> memory pressure and potentially swapping, is KSM or NUMA balancing active? I
> have a sound theory for how the scenario occurs on KVM's end, but I still
> think
> it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this
> situation.

I spent some time for this problem so late reply. When host dmesg print this
error messages which running install a new guest via automation. And I found
this bug's reproducer is run this install case after the mchine first time
running(Let me introduce more to avoid ambiguity： 1. Must to test it when the
machine first time running this kernel,that mean's if I hit this problem then
reboot host, it can not reproduced again even if I run the same tests.  2.
Based on 1, I also must test this installation guest case, it can not
reporduced on other cases.). But through compare, this installation cases only
used pxe install based on a internal KS cfg is different other cases.

Sure, I think it's running under memory pressure and swapping. Based on
automation log, KSM is disable and I don't add NUMA in qemu command line. 

If you have a machine can clone avocado and run tp-qemu tests, you can prepare
env then run this case:
unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads

> 
> And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
> memory exposed to it?  E.g. an assigned device?  If so, can you provide the
> register
> state from the other WARNs?  If the PFNs are all in the same range, then
> maybe
> this is something funky with the VM_PFNMAP | VM_IO path.

I can confirm it has VM_IO because it runing installation case, VM is
constantly performing I/O operations

This my tests used memory and CPU configured, hope it help you debug this
problem:
-m 29G \
-smp 48,maxcpus=48,cores=24,threads=1,dies=1,sockets=2  \

And looks like there are no other device  and no using VM_PFNMAP. Please
correct me if I'm wrong.

> 
> The WARN is a sanity check I added because it should be impossible for KVM to
> install a non-writable SPTE overtop an existing writable SPTE.  Or so I
> thought.
> The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM
> correctly handles the unexpected change, the WARN is there purely to flag
> that
> something unexpected happen.
> 
>       if (new_spte == iter->old_spte)
>               ret = RET_PF_SPURIOUS;
>       else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
>               return RET_PF_RETRY;
>       else if (is_shadow_present_pte(iter->old_spte) &&
>                (!is_last_spte(iter->old_spte, iter->level) ||
>                 WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte,
> new_spte)))) <====
>               kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
> 
> Cross referencing the register state
> 
>   RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000
>   RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015
>   RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002
>   R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000
>   R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000
> 
> with the disassembly
> 
>   4885C8                          TEST RAX,RCX
>   0F84EEFEFFFF                    JE 0000000000000-F1
>   4985C8                          TEST R8,RCX
>   0F85E5FEFFFF                    JNE 0000000000000-F1
>   0F0B                            UD2
>   
> RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is:
> 
>   860000025e000bf7
>   800000025e0009f5
> 
> On Intel, bits 57 and 58 are the host-writable and MMU-writable flags
> 
>   #define EPT_SPTE_HOST_WRITABLE      BIT_ULL(57)
>   #define EPT_SPTE_MMU_WRITABLE               BIT_ULL(58)
> 
> which means KVM is overwriting a writable SPTE with a non-writable SPTE
> because
> the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b)
> retrieved
> a non-writable PFN from the primary MMU, and that fault raced with a WRITE
> fault
> on a different vCPU that retrieved and installed a writable PFN.
> 
> On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a
> writable PFN.
> Given that KVM has an valid writable SPTE, the corresponding PTE in the
> primary MMU
> *must* be writable, otherwise there's a missing mmu_notifier invalidation.
> 
>       /* map read fault as writable if possible */
>       if (!(flags & FOLL_WRITE) && kfp->map_writable &&
>           get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
>               put_page(page);
>               page = wpage;
>               flags |= FOLL_WRITE;
>       }
> 
> out:
>       *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE);
>       return npages;
> 
> Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast
> GUP,
> but they should be mutually exclusive with KVM having a writable SPTE.  If
> the
> mapping is truncated or the folio is swapped out, secondary MMUs need to be
> invalidated before folio->mapping is nullified.
> 
>       /*
>        * The mapping may have been truncated, in any case we cannot determine
>        * if this mapping is safe - fall back to slow path to determine how to
>        * proceed.
>        */
>       if (!mapping)
>               return false;
> 
> And secretmem can't be GUP'd, and it's not a long-term pin, so these checks
> don't
> apply either:
> 
>       if (check_secretmem && secretmem_mapping(mapping))
>               return false;
>       /* The only remaining allowed file system is shmem. */
>       return !reject_file_backed || shmem_mapping(mapping);
> 
> Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is
> writable
> in the primary MMU, regardless of the fault type.
> 
> If this turns out to get a legitimate scenario, then I think it makes sense
> to
> add an is_access_allowed() check and treat the fault as spurious.  But I
> would
> like to try to bottom out on what exactly is happening, because I'm mildly
> concerned something is buggy in the primary MMU.

If you need me to provide more info, please feel free to let me know. And if
you  sent a patch to fix this problem I can help to verified it, since I think
I found the stable reproducer.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.