Re: [PATCH] KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW

Christoffer Dall <christoffer.dall@xxxxxxx> · Tue, 4 Sep 2018 15:37:27 +0200

On Tue, Sep 04, 2018 at 12:07:37PM +0100, Punit Agrawal wrote:
> Christoffer Dall <christoffer.dall@xxxxxxx> writes:
> 
> > On Mon, Sep 03, 2018 at 06:29:30PM +0100, Punit Agrawal wrote:
> >> Christoffer Dall <christoffer.dall@xxxxxxx> writes:
> >> 
> >> > [Adding Andrea and Steve in CC]
> >> >
> >> > On Thu, Aug 23, 2018 at 04:33:42PM +0100, Marc Zyngier wrote:
> >> >> When triggering a CoW, we unmap the RO page via an MMU notifier
> >> >> (invalidate_range_start), and then populate the new PTE using another
> >> >> one (change_pte). In the meantime, we'll have copied the old page
> >> >> into the new one.
> >> >> 
> >> >> The problem is that the data for the new page is sitting in the
> >> >> cache, and should the guest have an uncached mapping to that page
> >> >> (or its MMU off), following accesses will bypass the cache.
> >> >> 
> >> >> In a way, this is similar to what happens on a translation fault:
> >> >> We need to clean the page to the PoC before mapping it. So let's just
> >> >> do that.
> >> >> 
> >> >> This fixes a KVM unit test regression observed on a HiSilicon platform,
> >> >> and subsequently reproduced on Seattle.
> >> >> 
> >> >> Fixes: a9c0e12ebee5 ("KVM: arm/arm64: Only clean the dcache on translation fault")
> >> >> Reported-by: Mike Galbraith <efault@xxxxxx>
> >> >> Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx>
> >> >> ---
> >> >>  virt/kvm/arm/mmu.c | 9 ++++++++-
> >> >>  1 file changed, 8 insertions(+), 1 deletion(-)
> >> >> 
> >> >> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >> >> index 1d90d79706bd..287c8e274655 100644
> >> >> --- a/virt/kvm/arm/mmu.c
> >> >> +++ b/virt/kvm/arm/mmu.c
> >> >> @@ -1811,13 +1811,20 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data
> >> >>  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> >> >>  {
> >> >>  	unsigned long end = hva + PAGE_SIZE;
> >> >> +	kvm_pfn_t pfn = pte_pfn(pte);
> >> >>  	pte_t stage2_pte;
> >> >>  
> >> >>  	if (!kvm->arch.pgd)
> >> >>  		return;
> >> >>  
> >> >>  	trace_kvm_set_spte_hva(hva);
> >> >> -	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> >> >> +
> >> >> +	/*
> >> >> +	 * We've moved a page around, probably through CoW, so let's treat
> >> >> +	 * just like a translation fault and clean the cache to the PoC.
> >> >> +	 */
> >> >> +	clean_dcache_guest_page(pfn, PAGE_SIZE);
> >> >> +	stage2_pte = pfn_pte(pfn, PAGE_S2);
> >> >>  	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
> >> >>  }
> >> >
> >> > How does this work for pmd mappings?
> >> 
> >> kvm_set_spte_hva() isn't called for PMD mappings. But...
> >> 
> >> >
> >> > Are we guaranteed that a pmd mapping (hugetlbfs or THP) is split before
> >> > a CoW happens?
> >> >
> >> > Steve tells me that we share THP mappings on fork and that we back THPs
> >> > by a zero page, so CoW with THP should be possible.
> >> >
> >> 
> >> ...the problem seems to affect handling write permission faults for CoW
> >> or zero pages.
> >> 
> >> The memory gets unmapped at stage 2 due to the invalidate notifier (in
> >> hugetlb_cow() for hugetlbfs and do_huge_pmd_wp_page() for THP) 
> >
> > So just to make sure I get this right.  For a pte CoW, Linux calls the
> > set_spte function to simply change the pte mapping, without doing any
> > unmapping at stage 2, 
> 
> No.
> 
> I hadn't checked into the PTE CoW for zero pages when replying but was
> relying on Marc's commit log. I've outlined the flow below.
> 
> > but with pmd, we unmap and wait to take another fault as an
> > alternative method?
> 
> Having looked at handling of CoW handling for the different page sizes,
> here's my understanding of the steps for CoW faults - note the slight
> variance when dealing with PTE entries.
> 
> * Guest takes a stage 2 permission fault (user_mem_abort())
> 
> * The host mapping is updated to point to another page (either zeroed or
>   contents copied). This happens via the get_user_pages_unlocked()
>   invoked via gfn_to_pfn_prot().
> 
> * For all page sizes, mmu_invalidate_range_start() notifiers are called
>   which will unmap the memory at stage 2.
> 
> * For PTE (wp_page_copy), set_pte_at_notify() is called which eventually
>   calls kvm_set_pte_hva() modified in $SUBJECT.
> 
>   For hugepages (hugetlb_cow) and annonymous THP (do_huge_pmd_wp_page)
>   there are no notifying versions of page table entry updaters so stage
>   2 entries remain unmapped.
> 
> * mmu_notifier_invalidate_range_end() is called. This updates
>   mmu_notifier_seq which will abort the stage 2 fault handling in
>   user_mem_abort().
> 
> * As stage 2 was left unmapped for hugepages (no change_pte notifier),
>   there'll be a subsequent translation fault where the mapped page will
>   be cleaned/invalidated.
> 
> > Why the two different approaches depending on the mapping size?
> 
> I'm not sure why hugepage change notifier doesn't exist.
> 

Probably it's simply not implemented yet, or not considered important
because the number of faults are reduced with larger mappings anyhow.

> >
> >> while the
> >> cache maintenance for the newly allocated page will be skipped due to
> >> the !FSC_PERM. 
> >
> > If the above holds, then the subsequent fault would be similar to any
> > other fault which should work via the normal mechanism (dcache clean +
> > XN on fault, icache invalidation on permission fault).
> >> 
> >> Hmm... smells like there might be a problem here. I'll try and put
> >> together a fix.
> >> 
> >
> > It's not clear to me if we have a problem, or just different ways of
> > handling the PMD vs. PTE CoW case?
> 
> With the above context, I don't think there is a problem for hugepage
> handling. The missing dcache maintenance addressed by $SUBJECT should be
> sufficient for PTEs.
> 

Agreed.

Thanks for the explanation!

    Christoffer