On Sat, Dec 05, 2009 at 10:15:44PM +0200, Avi Kivity wrote: > On 12/05/2009 09:42 PM, Marcelo Tosatti wrote: >> >>> I don't think the OS has "other mechanisms", though - the processor can >>> speculate the tlb so that would be an OS bug. >> >> Can it? I figured it relied on the fact that no access (therefore no TLB >> entry instantiation) meant there is no need to invlpg (since there is >> nothing in the TLB to invalidate), before updating a particular pte. >> >> The documentation states that invlpg invalidates any entries for the >> linear address. >> > > 4.10.1.3 says, "The processor may cache translations required for > prefetches and for accesses that are a result of speculative execution > that would never actually occur in the executed code path.", so there is > no way for the OS to ensure no access has occurred. If you change a > present pte, you must execute invlpg afterwards to ensure speculation > hasn't instantiated the old pte. > > >>> It looks like a race: >>> >>>> Signed-off-by: Marcelo Tosatti<mtosatti@xxxxxxxxxx> >>>> >>>> >>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h >>>> index a601713..58a0f1e 100644 >>>> --- a/arch/x86/kvm/paging_tmpl.h >>>> +++ b/arch/x86/kvm/paging_tmpl.h >>>> @@ -455,8 +455,6 @@ out_unlock: >>>> static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) >>>> { >>>> struct kvm_shadow_walk_iterator iterator; >>>> - pt_element_t gpte; >>>> - gpa_t pte_gpa = -1; >>>> int level; >>>> u64 *sptep; >>>> int need_flush = 0; >>>> @@ -470,10 +468,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) >>>> if (level == PT_PAGE_TABLE_LEVEL || >>>> ((level == PT_DIRECTORY_LEVEL&& is_large_pte(*sptep))) || >>>> ((level == PT_PDPE_LEVEL&& is_large_pte(*sptep)))) { >>>> - struct kvm_mmu_page *sp = page_header(__pa(sptep)); >>>> - >>>> - pte_gpa = (sp->gfn<< PAGE_SHIFT); >>>> - pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t); >>>> >>>> if (is_shadow_present_pte(*sptep)) { >>>> rmap_remove(vcpu->kvm, sptep); >>>> @@ -492,18 +486,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) >>>> if (need_flush) >>>> kvm_flush_remote_tlbs(vcpu->kvm); >>>> spin_unlock(&vcpu->kvm->mmu_lock); >>>> - >>>> - if (pte_gpa == -1) >>>> - return; >>>> - if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa,&gpte, >>>> - sizeof(pt_element_t))) >>>> - return; >>>> >>> >>> >>> Here, another vcpu updates the gpte and issues a new invlpg. >>> >>> >>>> - if (is_present_gpte(gpte)&& (gpte& PT_ACCESSED_MASK)) { >>>> - if (mmu_topup_memory_caches(vcpu)) >>>> - return; >>>> - kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)&gpte, >>>> - sizeof(pt_element_t), 0); >>>> - } >>>> >>> >>> >>> And here we undo the correct invlpg with the outdated gpte. >>> >>> Looks like we considered this, since kvm_read_guest_atomic() is only >>> needed if inside the spinlock, but some other change moved the >>> spin_unlock() upwards. Will investigate history. >> >> Isnt it the OS responsability to serialize pte updates + invlpg between >> CPUs? > > It is. Do you still have a trace of the error? Maybe we can understand > what the guest thought it was doing. BAD_POOL_HEADER (19) The pool is already corrupt at the time of the current request. This may or may not be due to the caller. The internal pool links must be walked to figure out a possible cause of the problem, and then special pool applied to the suspect tags or the driver verifier to a suspect driver. Arguments: Arg1: 00000021, the data following the pool block being freed is corrupt. Typically this means the consumer (call stack ) has overrun the block. Arg2: 95424000, The pool pointer being freed. Arg3: 00001010, The number of bytes allocated for the pool block. Arg4: 00000000, The corrupted value found following the pool block. The BAD_POOL_HEADER BSOD happens at address 0xFFFFF8A000DDD000 (complaining it contains "000000", Arg4). Walking the pagetables takes to 0x18996 as the pte page: (qemu) xp 0x18996ee8 (vaddr 0xFFFFF8A000DDD000) 0000000018996ee8: 0x153c9963 (qemu) xp 0x18996ef0 (vaddr 0xFFFFF8A000DDE000) 0000000018996ef0: 0x1528a963 qemu-system-x86-13667 [007] 425860.260987: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15f11 invlpg=1 qemu-system-x86-13670 [004] 425860.264977: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15253 invlpg=1 qemu-system-x86-13670 [004] 425860.265039: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15f15 invlpg=1 qemu-system-x86-13670 [004] 425860.266591: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 146f3 invlpg=1 qemu-system-x86-13670 [004] 425860.268128: kvm_mmu_pte_write: sp->gfn 18996 (offset=ee8) gfn 14688 invlpg=1 qemu-system-x86-13670 [004] 425860.268592: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 159c7 invlpg=1 qemu-system-x86-13669 [003] 425861.267453: kvm_mmu_zap_page: sp gfn 18996 1/4 q0 w-x pge nxe root 0 unsync The page is not shadowed again after this. gfn 0x159c7 above contains a valid "end of pool block" header: 00000000 00 00 00 00 00 00 00 00 10 10 00 00 00 00 00 00 |................| 00000010 00 01 01 03 46 72 61 67 1c 00 33 d2 48 8d 44 24 |....Frag..3.H.D$| 00000020 01 01 fe 00 46 72 65 65 48 8d 44 04 30 48 f7 f1 |....FreeH.D.0H..| But not the one which is in the pagetable at the time of BSOD: (qemu) xp /20x 0x1528a000 000000001528a000: 0x00000000 0x00000000 0x00000000 0x00000000 000000001528a010: 0x00000000 0x00000000 0x00000000 0x00000000 000000001528a020: 0x00000000 0x00000000 0x00000000 0x00000000 000000001528a030: 0x00000000 0x00000000 0x00000000 0x00000000 000000001528a040: 0x00000000 0x00000000 0x00000000 0x00000000 So my theory was, Windows wrote gfn 0x1528a to offset 0xef0, but skipped the invlpg, so a write to address 0xFFFFF8A000DDE000 ended up in the wrong gfn (the one prefaulted by the invlpg handler). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html