Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())

John Hubbard <jhubbard@xxxxxxxxxx> · Thu, 29 Sep 2022 19:31:03 -0700

On 9/28/22 11:12, Jason Gunthorpe wrote:
On Tue, Sep 27, 2022 at 12:24:41AM +0000, Sean Christopherson wrote:
On Mon, Sep 26, 2022, Jason Gunthorpe wrote:
On Mon, Sep 26, 2022 at 08:13:00PM +0000, Sean Christopherson wrote:

AFAIK if we are flushing the CPU tlb then we really must also flush
the CPU tlb that KVM controls, and that is primarily what
invalidate_range() is used for.

As above, for its actual secondary MMU, KVM invalidates and flushes at
invalidate_range_start(), and then prevents vCPUs from creating new entries for
the range until invalidate_range_start_end().

Was it always like this? Why did we add this invalidate_range thing if
nothing really needed it?

No, the invalidate_range() hook was added by commit 1897bdc4d331 ("mmu_notifier:
add mmu_notifier_invalidate_range()") for IOMMUs.

Ah, right OK. This is specifically because the iommu is sharing the
*exact* page table of the CPU so the trick KVM/etc uses where 'start'
makes the shadow PTE non-present and then delays the fault until end
completes cannot work here.

ohhh, is this trick something I should read more about, if I'm about to
jump in here?

     The page-fault handler in the AMD IOMMUv2 driver doesn't handle the fault
     if an invalidate_range_start/end pair is active, it just reports back
     SUCCESS to the device and let it refault the page.

Yah, this algorithm just doesn't really work, IMHO.. So it makes sense
we have invalidate_range as Joerg originally created it. Though the
GPU is still busted IMHO, there is no guarantee of forward progress
after some number of iterations, it is just much more likely if the
non-present is as narrow as possible.

So, then we can see where the end_only thing came from, commit
0f10851ea475 ("mm/mmu_notifier: avoid double notification when it is
useless") and that long winded message explains why some of the cases

I seem to recall that there was a performance drop involving GPUs, due
to the double notification. Just to fill in a little bit of history as
to why Jerome was trying to deduplicate the notifier callbacks.

must be ordered in the same place as the CPU flush, but doesn't
explain very much why it is OK to push it after beyond saying "ksm is
OK"

Looking at some of the places where 0f10851ea475 removed the notifies
they seem pretty pointless.

- fs/dax.c
   This never needed notify in the first place, it is populating a
   non-present PTE because it just faulted.

- __split_huge_zero_page_pmd()
   Sure, maybe, but who cares? The real fix here was changing
   __split_huge_pmd() to use only_end() because all paths already
   call invalidate_range

- copy_hugetlb_page_range()
   Sure, there is no CPU tlb flush.

   The CPU tlb flush on this path is in flush_tlb_mm() called by
   dup_mmap().

   The right thing to do is to ensure flush_tlb_mm() calls
   invalidate_range and skip it here. But the reasoning is not some
   "we are downgrading protections blah blah", the logic is that the
   CPU TLB flush can be delayed/consolidated so we can delay the
   shadow TLB flush too.

   (And why does copy_hugetlb_page_range use MMU_NOTIFY_CLEAR but
    copy_p4d_range is bounded by MMU_NOTIFY_PROTECTION_PAGE ??)

- hugetlb_change_protection()
   Again I feel like the sensible thing here is to trigger the shadow
   flush in flush_hugetlb_tlb_range() always and use end_only

.. and so on ..

So, IMHO, we need to rewrite what 0f10851ea475 was trying to do in
order to fix the bug Jann noticed :\  That is bigger than I can knock
off while I write this email though ..

After an initial pass through this, with perhaps 80% understanding
of the story, I'm reading that as:

    Audit all the sites (which you initially did quickly, above)
    that 0f10851ea475 touched, and any other related ones, and
    change things so that invalidate_range() and primary TLB
    flushing happen at the same point(s).

Yes? Anything else?

thanks,
--
John Hubbard
NVIDIA

That means iommu is really the only place using it as a proper
synchronous shadow TLB flush.

More or less. There's also an "OpenCAPI coherent accelerator
support" driver, drivers/misc/ocxl, that appears use
invalidate_range() the same way the IOMMU does.  No idea how
relevant that is these days.

Yeah, OpenCAPI is the same stuff as the IOMMU. Just PPC got away with
building all their IOMMU layer in its own arch specific subsystem :|

I much prefer KVM's (and the old IOMMU's) approach of re-faulting in hardware until
the entire sequence completes.   It _might_ be less performant, but I find it so
much easier to reason about.  I actually had typed out a "can we just kill off
mmu_notifier_invalidate_range() and force users to refault hardware" question
before seeing the above changelog.

The key thing this requires is the ability to put the hardware into
fault mode (non-present), for the range under invalidation. If you
can't do that, then you can't use it.

I don't know.  I found the series that introduced the behavior[*], but there are
no numbers provided and I haven't been able to dredge up why this was even looked
into in the first place.  From the cover letter:

It looks like a 'by inspection' project..

If I had a vote to cast, I would vote to always do invalidate_range() at the same
time the primary TLBs are flushed.  That seems completely logical and much harder
to screw up.  I might be a little biased though since KVM doesn't benefit from the
current shenanigans :-)

Me too.

Jason