Re: kvm splat in mmu_spte_clear_track_bits

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Tue, 29 Aug 2017 22:49:01 +0200

Hello Linus,

On Tue, Aug 29, 2017 at 12:38:43PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 12:13 PM, Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
> >
> > Yes and i am fine with page traversal being under spinlock and not
> > being able to sleep during that. I agree doing otherwise would be
> > insane. It is just that the existing behavior of try_to_unmap_one()
> > and page_mkclean_one() have been broken and that no mmu_notifier
> > calls were added around the lock section.
> 
> Yeah, I'm actually surprised that ever worked. I'm surprised that
> try_to_unmap_one didn't hold any locks earlier.
> 
> In fact, I think at least some of them *did* already hold the page
> table locks: ptep_clear_flush_young_notify() and friends very much
> should have always held them.
> 
> So it's literally just that mmu_notifier_invalidate_page() call that
> used to be outside all the locks, but honestly, I think that was
> always a bug. It means that you got notified of the page removal
> *after* the page was already gone and all locks had been released, so
> a completely *different* page could already have been mapped to that
> address.
> 
> So I think the old code was always broken exactly because the callback
> wasn't serialized with the actual action.

That's a very interesting point indeed.

I thought the old code was safe because even if the secondary MMU view
keeps crunching on the old page for a little while after the primary
MMU got a new page mapped, you cannot measure it.

If you cannot measure it, it doesn't exist? I.e. undefined.

The same runtime you'd get by letting the guest mode crunch on the old
page for a little while after the primary MMU is already computing on
a brand new page, could materialize if the guest mode CPU just got a
turbo boost and computed faster before the ->invalidate_page was run
inside the PT lock.

If the page is swapped-in and is the same page that was unmapped by
try_to_unmap_one, then the primary MMU will return using the same page
that the secondary MMU was still using and it'll even be coherent
(swapin is probably prevented by the page lock, but still even if it
happens it doesn't look an issue).

If the page is suddenly replaced while vcpu is in guest mode, no
coherency could be provided anyway if the guest mode wasn't serialize
by something other than the PT lock.

I'll keep thinking about it, this is a quick answer.

On a side note, the other major requirement for the code not to have
been always broken before is that the caller must hold a refcount: the
put_page executed by try_to_unmap_one before calling
mmu_notifier_invalidate_page cannot free the page or it's unsafe. This
requirement also goes away if using range_start/end.

> > I sent a patch that properly compute the range to invalidate and move
> > to invalidate_range() but is lacking the invalidate_range_start()/
> > end() so i am gonna respin that with range_start/end bracketing and
> > assume the worse for the range of address.
> 
> So surrounding it with start/end _should_ make KVM happy.
> 
> KVM people, can you confirm?

Yes that always works as far as I can tell.

For our current problem using start/end like you suggested would
definitely make KVM happy and it is obviously safe.

It would also work to teach page_vma_mapped_walk(&pvmw) to restart
after dropping the lock with page_vma_mapped_walk_done() to call
mmu_notifier_invalidate_page outside the lock. If such an alternative
would be safe however, entirely depends if it was always broken before
which again is a very good point to think about.

The common case for such function is a single pte being unmapped.

> 
> But I do note that there's a number of other users of that
> "invalidate_page" callback.
> 
> I think ib_umem_notifier_invalidate_page() the exact same blocking
> issue, but changing to range_start/end should be good there too.
> 
> amdgpu_mn_invalidate_page() and the xen/gntdev also seem to be happy
> being replaced with start/end.
> 
> In fact, I'm wondering if this actually means that we could get rid of
> mmu_notifier_invalidate_page() entirely. There's only a couple of
> callers, and the other one seems to be fs/dax.c, and it actually seems
> to have the exact same issue that the try_to_unmap_one() code had: it
> tried to invalidate an address too late - by the time it was called,
> the page gad already been cleaned and locks had been released.
> 
> So the more I look at that "turn mmu_notifier_invalidate_page() into
> invalidate_range_start/end()" the more I think that's fundamentally
> the right thing to do.

mmu_notifier_invalidate_page exists purely as an optimized version of
mmu_notifier_invalidate_range_start/end. Furthermore when it was
always run inside the PT lock it was quite handy to replace a
ptep_clear_flush into a ptep_clear_flush_notify and be done with
it. It also requires a single branch when the MMU notifier is unarmed.

mmu_notifier_invalidate_range_start has to first block all secondary
MMU page faults, and invalidate the secondary MMU _before_ the pages
are freed. mmu_notifier_invalidate_range_end unblocks the secondary
MMU page faults after the primary MMU has been invalidated too
(i.e. after zapping the ptes).

mmu_notifier_invalidate_page has the advantage that it takes the
secondary MMU KVM srcu and spinlock a single time.

Thanks!
Andrea