On Tue, Sep 04, 2018 at 12:07:37PM +0100, Punit Agrawal wrote: > Christoffer Dall <christoffer.dall@xxxxxxx> writes: > > > On Mon, Sep 03, 2018 at 06:29:30PM +0100, Punit Agrawal wrote: > >> Christoffer Dall <christoffer.dall@xxxxxxx> writes: > >> > >> > [Adding Andrea and Steve in CC] > >> > > >> > On Thu, Aug 23, 2018 at 04:33:42PM +0100, Marc Zyngier wrote: > >> >> When triggering a CoW, we unmap the RO page via an MMU notifier > >> >> (invalidate_range_start), and then populate the new PTE using another > >> >> one (change_pte). In the meantime, we'll have copied the old page > >> >> into the new one. > >> >> > >> >> The problem is that the data for the new page is sitting in the > >> >> cache, and should the guest have an uncached mapping to that page > >> >> (or its MMU off), following accesses will bypass the cache. > >> >> > >> >> In a way, this is similar to what happens on a translation fault: > >> >> We need to clean the page to the PoC before mapping it. So let's just > >> >> do that. > >> >> > >> >> This fixes a KVM unit test regression observed on a HiSilicon platform, > >> >> and subsequently reproduced on Seattle. > >> >> > >> >> Fixes: a9c0e12ebee5 ("KVM: arm/arm64: Only clean the dcache on translation fault") > >> >> Reported-by: Mike Galbraith <efault@xxxxxx> > >> >> Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx> > >> >> --- > >> >> virt/kvm/arm/mmu.c | 9 ++++++++- > >> >> 1 file changed, 8 insertions(+), 1 deletion(-) > >> >> > >> >> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > >> >> index 1d90d79706bd..287c8e274655 100644 > >> >> --- a/virt/kvm/arm/mmu.c > >> >> +++ b/virt/kvm/arm/mmu.c > >> >> @@ -1811,13 +1811,20 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data > >> >> void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) > >> >> { > >> >> unsigned long end = hva + PAGE_SIZE; > >> >> + kvm_pfn_t pfn = pte_pfn(pte); > >> >> pte_t stage2_pte; > >> >> > >> >> if (!kvm->arch.pgd) > >> >> return; > >> >> > >> >> trace_kvm_set_spte_hva(hva); > >> >> - stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2); > >> >> + > >> >> + /* > >> >> + * We've moved a page around, probably through CoW, so let's treat > >> >> + * just like a translation fault and clean the cache to the PoC. > >> >> + */ > >> >> + clean_dcache_guest_page(pfn, PAGE_SIZE); > >> >> + stage2_pte = pfn_pte(pfn, PAGE_S2); > >> >> handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte); > >> >> } > >> > > >> > How does this work for pmd mappings? > >> > >> kvm_set_spte_hva() isn't called for PMD mappings. But... > >> > >> > > >> > Are we guaranteed that a pmd mapping (hugetlbfs or THP) is split before > >> > a CoW happens? > >> > > >> > Steve tells me that we share THP mappings on fork and that we back THPs > >> > by a zero page, so CoW with THP should be possible. > >> > > >> > >> ...the problem seems to affect handling write permission faults for CoW > >> or zero pages. > >> > >> The memory gets unmapped at stage 2 due to the invalidate notifier (in > >> hugetlb_cow() for hugetlbfs and do_huge_pmd_wp_page() for THP) > > > > So just to make sure I get this right. For a pte CoW, Linux calls the > > set_spte function to simply change the pte mapping, without doing any > > unmapping at stage 2, > > No. > > I hadn't checked into the PTE CoW for zero pages when replying but was > relying on Marc's commit log. I've outlined the flow below. > > > but with pmd, we unmap and wait to take another fault as an > > alternative method? > > Having looked at handling of CoW handling for the different page sizes, > here's my understanding of the steps for CoW faults - note the slight > variance when dealing with PTE entries. > > * Guest takes a stage 2 permission fault (user_mem_abort()) > > * The host mapping is updated to point to another page (either zeroed or > contents copied). This happens via the get_user_pages_unlocked() > invoked via gfn_to_pfn_prot(). > > * For all page sizes, mmu_invalidate_range_start() notifiers are called > which will unmap the memory at stage 2. > > * For PTE (wp_page_copy), set_pte_at_notify() is called which eventually > calls kvm_set_pte_hva() modified in $SUBJECT. > > For hugepages (hugetlb_cow) and annonymous THP (do_huge_pmd_wp_page) > there are no notifying versions of page table entry updaters so stage > 2 entries remain unmapped. > > * mmu_notifier_invalidate_range_end() is called. This updates > mmu_notifier_seq which will abort the stage 2 fault handling in > user_mem_abort(). > > * As stage 2 was left unmapped for hugepages (no change_pte notifier), > there'll be a subsequent translation fault where the mapped page will > be cleaned/invalidated. > > > Why the two different approaches depending on the mapping size? > > I'm not sure why hugepage change notifier doesn't exist. > Probably it's simply not implemented yet, or not considered important because the number of faults are reduced with larger mappings anyhow. > > > >> while the > >> cache maintenance for the newly allocated page will be skipped due to > >> the !FSC_PERM. > > > > If the above holds, then the subsequent fault would be similar to any > > other fault which should work via the normal mechanism (dcache clean + > > XN on fault, icache invalidation on permission fault). > >> > >> Hmm... smells like there might be a problem here. I'll try and put > >> together a fix. > >> > > > > It's not clear to me if we have a problem, or just different ways of > > handling the PMD vs. PTE CoW case? > > With the above context, I don't think there is a problem for hugepage > handling. The missing dcache maintenance addressed by $SUBJECT should be > sufficient for PTEs. > Agreed. Thanks for the explanation! Christoffer