On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote: > On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote: > > On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote: > > > When S1 and S2 memory attributes combine wrt to caching policy, > > > non-cacheable types take precedence. If a guest maps a region as > > > device memory, which KVM userspace is using to emulate the device > > > using normal, cacheable memory, then we lose coherency. With > > > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory > > > regions are likely to be problematic. With this patch, as pages > > > of these types of regions are faulted into the guest, not only do > > > we flush the page's dcache, but we also change userspace's > > > mapping to NC in order to maintain coherency. > > > > > > What if the guest doesn't do what we expect? While we can't > > > force a guest to use cacheable memory, we can take advantage of > > > the non-cacheable precedence, and force it to use non-cacheable. > > > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on > > > KVM_MEM_UNCACHED regions to force them to NC. > > > > > > We now have both guest and userspace on the same page (pun intended) > > > > I'd like to revisit the overall approach here. Is doing non-cached > > accesses in both the guest and host really the right thing to do here? > > I think so, but all ideas/approaches are still on the table. This is > still an RFC. > > > > > The semantics of the device becomes that it is cache coherent (because > > QEMU is), and I think Marc argued that Linux/UEFI should simply be > > adapted to handle whatever emulated devices we have as coherent. I also > > remember someone arguing that would be wrong (Peter?). > > I'm not really for quirking all devices in all guest types (AAVMF, Linux, > other bootloaders, other OSes). Windows is unlikely to apply any quirks. > Well my point was that if we're emulating a platform with coherent IO memory for PCI devices that is something that the guest should work with as such, but as Paolo explained it should always be safe for a guest to assume non-coherent, so that doesn't work. > > > > Finally, does this address all cache coherency issues with emulated > > devices? Some VOS guys had seen things still not working with this > > approach, unsure why... I'd like to avoid us merging this only to merge > > a more complete solution in a few weeks which reverts this solution... > > I'm not sure (this is still an RFT too :-) We definitely would need to > scatter some more memory_region_set_uncached() calls around QEMU first. > It would be good if you could sync with the VOS guys and make sure your patch set addresses their issues with the appropriate memory_region_set_uncached() added to QEMU, and if it does not, some vague idea why that falls outside of the scope of this patch set. After all, adding a USB controller to a VM is not that an esoteric use case, is it? > > > > More comments/questions below: > > > > > > > > Signed-off-by: Andrew Jones <drjones@xxxxxxxxxx> > > > --- > > > arch/arm/include/asm/kvm_mmu.h | 5 ++++- > > > arch/arm/include/asm/pgtable-3level.h | 1 + > > > arch/arm/include/asm/pgtable.h | 1 + > > > arch/arm/kvm/mmu.c | 37 +++++++++++++++++++++++------------ > > > arch/arm64/include/asm/kvm_mmu.h | 5 ++++- > > > arch/arm64/include/asm/memory.h | 1 + > > > arch/arm64/include/asm/pgtable.h | 1 + > > > 7 files changed, 36 insertions(+), 15 deletions(-) > > > > > > diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h > > > index 405aa18833073..e8034a80b12e5 100644 > > > --- a/arch/arm/include/asm/kvm_mmu.h > > > +++ b/arch/arm/include/asm/kvm_mmu.h > > > @@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn, > > > while (size) { > > > void *va = kmap_atomic_pfn(pfn); > > > > > > - if (need_flush) > > > + if (need_flush) { > > > kvm_flush_dcache_to_poc(va, PAGE_SIZE); > > > + if (ipa_uncached) > > > + set_memory_nc((unsigned long)va, 1); > > > > nit: consider moving this outside the need_flush > > > > > + } > > > > > > if (icache_is_pipt()) > > > __cpuc_coherent_user_range((unsigned long)va, > > > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h > > > index a745a2a53853c..39b3f7a40e663 100644 > > > --- a/arch/arm/include/asm/pgtable-3level.h > > > +++ b/arch/arm/include/asm/pgtable-3level.h > > > @@ -121,6 +121,7 @@ > > > * 2nd stage PTE definitions for LPAE. > > > */ > > > #define L_PTE_S2_MT_UNCACHED (_AT(pteval_t, 0x0) << 2) /* strongly ordered */ > > > +#define L_PTE_S2_MT_NORMAL_NC (_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */ > > > #define L_PTE_S2_MT_WRITETHROUGH (_AT(pteval_t, 0xa) << 2) /* normal inner write-through */ > > > #define L_PTE_S2_MT_WRITEBACK (_AT(pteval_t, 0xf) << 2) /* normal inner write-back */ > > > #define L_PTE_S2_MT_DEV_SHARED (_AT(pteval_t, 0x1) << 2) /* device */ > > > diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h > > > index f40354198bad4..ae13ca8b0a23d 100644 > > > --- a/arch/arm/include/asm/pgtable.h > > > +++ b/arch/arm/include/asm/pgtable.h > > > @@ -100,6 +100,7 @@ extern pgprot_t pgprot_s2_device; > > > #define PAGE_HYP _MOD_PROT(pgprot_kernel, L_PTE_HYP) > > > #define PAGE_HYP_DEVICE _MOD_PROT(pgprot_hyp_device, L_PTE_HYP) > > > #define PAGE_S2 _MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY) > > > +#define PAGE_S2_NORMAL_NC __pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC) > > > #define PAGE_S2_DEVICE _MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY) > > > > > > #define __PAGE_NONE __pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE) > > > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c > > > index bc1665acd73e7..6b3bd8061bd2a 100644 > > > --- a/arch/arm/kvm/mmu.c > > > +++ b/arch/arm/kvm/mmu.c > > > @@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > struct vm_area_struct *vma; > > > pfn_t pfn; > > > pgprot_t mem_type = PAGE_S2; > > > - bool fault_ipa_uncached; > > > + bool fault_ipa_uncached = false; > > > bool logging_active = memslot_is_logging(memslot); > > > unsigned long flags = 0; > > > > > > @@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > writable = false; > > > } > > > > > > + if (memslot->flags & KVM_MEM_UNCACHED) { > > > + mem_type = PAGE_S2_NORMAL_NC; > > > + fault_ipa_uncached = true; > > > + } > > > + > > > spin_lock(&kvm->mmu_lock); > > > if (mmu_notifier_retry(kvm, mmu_seq)) > > > goto out_unlock; > > > @@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > if (!hugetlb && !force_pte) > > > hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa); > > > > > > - fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED; > > > - > > > if (hugetlb) { > > > pmd_t new_pmd = pfn_pmd(pfn, mem_type); > > > new_pmd = pmd_mkhuge(new_pmd); > > > @@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm, > > > unsigned long start, > > > unsigned long end, > > > int (*handler)(struct kvm *kvm, > > > + struct kvm_memory_slot *slot, > > > gpa_t gpa, void *data), > > > void *data) > > > { > > > @@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm, > > > > > > for (; gfn < gfn_end; ++gfn) { > > > gpa_t gpa = gfn << PAGE_SHIFT; > > > - ret |= handler(kvm, gpa, data); > > > + ret |= handler(kvm, memslot, gpa, data); > > > } > > > } > > > > > > return ret; > > > } > > > > > > -static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data) > > > +static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot, > > > + gpa_t gpa, void *data) > > > > Maybe we should consider a pointer to a struct with the relevant data to > > pass around to the handler by now, which would allow us to get rid of > > the void * cast as well. Not sure if it's worth it. > > > > > { > > > unmap_stage2_range(kvm, gpa, PAGE_SIZE); > > > return 0; > > > @@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm, > > > return 0; > > > } > > > > > > -static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data) > > > +static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot, > > > + gpa_t gpa, void *data) > > > { > > > - pte_t *pte = (pte_t *)data; > > > + pte_t pte = *((pte_t *)data); > > > + > > > + if (slot->flags & KVM_MEM_UNCACHED) > > > + pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC); > > > + else > > > + pte = pfn_pte(pte_pfn(pte), PAGE_S2); > > > > > > /* > > > * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE > > > @@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data) > > > * therefore stage2_set_pte() never needs to clear out a huge PMD > > > * through this calling path. > > > */ > > > - stage2_set_pte(kvm, NULL, gpa, pte, 0); > > > + stage2_set_pte(kvm, NULL, gpa, &pte, 0); > > > > this is making me feel like we should have a separate patch that changes > > stage2_set_pte from taking a pointer to just taking a value for the new > > pte. > > > > > return 0; > > > } > > > > > > @@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data) > > > void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) > > > { > > > unsigned long end = hva + PAGE_SIZE; > > > - pte_t stage2_pte; > > > > > > if (!kvm->arch.pgd) > > > return; > > > > > > trace_kvm_set_spte_hva(hva); > > > - stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2); > > > - handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte); > > > + handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte); > > > > hooking in here will make sure you catch changes to the page used for > > the mapping, but wouldn't that also mean that the userspace mapping > > would have been change, and where are you updating this? > > > > Also, is this called if the userspace mapping is zapped without doing > > anything about the underlying page? (how do we then catch when the > > userspace pte is populated again, and is this even possible?) > > I was hoping that I only needed to worry about getting the S2 attributes > right here, and then, since the page will need to be refaulted into > the guest anyway, that the userspace part would get taken care of at > that point (in user_mem_abort). user_mem_abort handles stage-2 page table faults, which has (almost) nothing to do with the user space page tables. I think it's entirely possible to have a stage-2 mapping to a page, which is no longer mapped to userspace at all. Or do we pin the userspace PTEs (not just keep a reference on the physical page) during fault handling? > But, to be honest, I forgot to dig into > this deep enough to know if my hope will work or not. > You really need to work this out so you feel confident with the overall scheme here, then I can try to see if I can break it under review, but I think the author of this patch must know how it is *supposed* to work ;) > > > > > } > > > > > > -static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data) > > > +static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot, > > > + gpa_t gpa, void *data) > > > { > > > pmd_t *pmd; > > > pte_t *pte; > > > @@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data) > > > return 0; > > > } > > > > > > -static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data) > > > +static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot, > > > + gpa_t gpa, void *data) > > > { > > > pmd_t *pmd; > > > pte_t *pte; > > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h > > > index 61505676d0853..af5f0f0eccef9 100644 > > > --- a/arch/arm64/include/asm/kvm_mmu.h > > > +++ b/arch/arm64/include/asm/kvm_mmu.h > > > @@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn, > > > { > > > void *va = page_address(pfn_to_page(pfn)); > > > > > > - if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) > > > + if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) { > > > kvm_flush_dcache_to_poc(va, size); > > > + if (ipa_uncached) > > > + set_memory_nc((unsigned long)va, size/PAGE_SIZE); > > > > are you not setting the kernel mapping of the page to non-cached here, > > which doesn't affect your userspace mappings at all? > > Oh crap. I shouldn't have tried to use change_memory_common... I > completely overlooked the fact I'm now using the wrong mm... > and the wrong va... > > > > (this would explain why things still break with this series). > > yeah, I wonder why it works so well? > luck/slowed things down/reordered operations to make things less likely would be my guess. Thanks, -Christoffer _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm