[Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]

bugzilla-daemon@xxxxxxxxxx · Wed, 11 Dec 2024 16:12:14 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #1 from Sean Christopherson (seanjc@xxxxxxxxxx) ---
On Wed, Dec 11, 2024, bugzilla-daemon@xxxxxxxxxx wrote:
> I hit a bug on the intel host, this problem occurs randomly:
> [  406.127925] ------------[ cut here ]------------
> [  406.132572] WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001
> tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]

Can you describe the host activity at the time of the WARN?  E.g. is it under
memory pressure and potentially swapping, is KSM or NUMA balancing active? I
have a sound theory for how the scenario occurs on KVM's end, but I still think
it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this
situation.

And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
memory exposed to it?  E.g. an assigned device?  If so, can you provide the
register
state from the other WARNs?  If the PFNs are all in the same range, then maybe
this is something funky with the VM_PFNMAP | VM_IO path.

The WARN is a sanity check I added because it should be impossible for KVM to
install a non-writable SPTE overtop an existing writable SPTE.  Or so I
thought.
The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM
correctly handles the unexpected change, the WARN is there purely to flag that
something unexpected happen.

        if (new_spte == iter->old_spte)
                ret = RET_PF_SPURIOUS;
        else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
                return RET_PF_RETRY;
        else if (is_shadow_present_pte(iter->old_spte) &&
                 (!is_last_spte(iter->old_spte, iter->level) ||
                  WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte,
new_spte)))) <====
                kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);

Cross referencing the register state

  RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000
  RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015
  RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002
  R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000
  R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000

with the disassembly

  4885C8                          TEST RAX,RCX
  0F84EEFEFFFF                    JE 0000000000000-F1
  4985C8                          TEST R8,RCX
  0F85E5FEFFFF                    JNE 0000000000000-F1
  0F0B                            UD2

RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is:

  860000025e000bf7
  800000025e0009f5

On Intel, bits 57 and 58 are the host-writable and MMU-writable flags

  #define EPT_SPTE_HOST_WRITABLE        BIT_ULL(57)
  #define EPT_SPTE_MMU_WRITABLE         BIT_ULL(58)

which means KVM is overwriting a writable SPTE with a non-writable SPTE because
the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b)
retrieved
a non-writable PFN from the primary MMU, and that fault raced with a WRITE
fault
on a different vCPU that retrieved and installed a writable PFN.

On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a writable
PFN.
Given that KVM has an valid writable SPTE, the corresponding PTE in the primary
MMU
*must* be writable, otherwise there's a missing mmu_notifier invalidation.

        /* map read fault as writable if possible */
        if (!(flags & FOLL_WRITE) && kfp->map_writable &&
            get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
                put_page(page);
                page = wpage;
                flags |= FOLL_WRITE;
        }

out:
        *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE);
        return npages;

Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast
GUP,
but they should be mutually exclusive with KVM having a writable SPTE.  If the
mapping is truncated or the folio is swapped out, secondary MMUs need to be
invalidated before folio->mapping is nullified.

        /*
         * The mapping may have been truncated, in any case we cannot determine
         * if this mapping is safe - fall back to slow path to determine how to
         * proceed.
         */
        if (!mapping)
                return false;

And secretmem can't be GUP'd, and it's not a long-term pin, so these checks
don't
apply either:

        if (check_secretmem && secretmem_mapping(mapping))
                return false;
        /* The only remaining allowed file system is shmem. */
        return !reject_file_backed || shmem_mapping(mapping);

Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is
writable
in the primary MMU, regardless of the fault type.

If this turns out to get a legitimate scenario, then I think it makes sense to
add an is_access_allowed() check and treat the fault as spurious.  But I would
like to try to bottom out on what exactly is happening, because I'm mildly
concerned something is buggy in the primary MMU.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.