Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Fri, 12 Sep 2014 12:10:36 -0700

On Wed, 10 Sep 2014 20:02:12 -0400 Jerome Glisse <j.glisse@xxxxxxxxx> wrote:

> On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@xxxxxxxxxx> wrote:
> > 
> > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > be implemented in IOMMUs or any other external device, e.g.
> > > ATS/PRI capable PCI devices.
> > > 
> > > The problem with managing these TLBs are the semantics of
> > > the invalidate_range_start/end call-backs currently
> > > available. Currently the subsystem using mmu_notifiers has
> > > to guarantee that no new TLB entries are established between
> > > invalidate_range_start/end. Furthermore the
> > > invalidate_range_start() function is called when all pages
> > > are still mapped and invalidate_range_end() when the pages
> > > are unmapped an already freed.
> > > 
> > > So both call-backs can't be used to safely flush any non-CPU
> > > TLB because _start() is called too early and _end() too
> > > late.
> > 
> > There's a lot of missing information here.  Why don't the existing
> > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > update the changelog to contain all this context.
> 
> So unlike KVM or any current user of the mmu_notifier api, any PCIE
> ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> name for Intel (probably VTd) implementation walk the cpu page table
> on there own and have there own TLB cache. In fact not only the iommu
> can have a TLB cache but any single PCIE hardware can implement its
> own local TLB cache.
> 
> So if we flush the IOMMU and device TLB inside the range_start callback
> there is a chance that the hw will just rewalk the cpu page table and
> repopulate its TLB before the CPU page table is actually updated.
> 
> Now if we shoot down the TLB inside the range_end callback, then we
> are too late ie the CPU page table is already populated with new entry
> and all the TLB in the IOMMU an in device might be pointing to the old
> pages.
> 
> So the aim of this callback is to happen right after the CPU page table
> is updated but before the old page is freed or recycled.

You had me up to here.  What is "the old page"?  pagetable page, I
assume?  upper, middle, lower, all the above?  Why does the callback
function need to get at that page?

> > Also too skimpy.  I *think* this is a variant of the problem in the
> > preceding paragraph.  We get a fault storm (which is problem 2) and
> > sometimes the faulting device gives up (which is problem 1).
> > 
> > Or something.  Please de-fog all of this.
> > 
> 
> Does above explanation help understand the issue ? Given that on each
> device page fault an IRQ is trigger (well the way the hw works is bit
> more complex than that).

It helps.  Please understand that my main aim here is not to ask
specific questions.  I seek to demonstrate the level of detail which I
believe should be in the changelog so that others can effectively
understand and review this patchset, and to understand this code in the
future.

So it would be good if you were to update the changelog to answer these
questions.  It would be better if it were also to answer similar
questions which I didn't think to ask!

> > > Furthermore the _start()/end() notifiers only catch the
> > > moment when page mappings are released, but not page-table
> > > pages. But this is necessary for managing external TLBs when
> > > the page-table is shared with the CPU.
> > 
> > How come?
> 
> As explained above end() might happens after page that were previously
> mapped are free or recycled.

Again, some explanation of why the driver wishes to inspect the
pagetable pages would be useful.

> > 
> > > To solve this situation I wrote a patch-set to introduce a
> > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > notifier lifts the strict requirements that no new
> > > references are taken in the range between _start() and
> > > _end(). When the subsystem can't guarantee that any new
> > > references are taken is has to provide the
> > > invalidate_range() call-back to clear any new references in
> > > there.
> > > 
> > > It is called between invalidate_range_start() and _end()
> > > every time the VMM has to wipe out any references to a
> > > couple of pages. This are usually the places where the CPU
> > > TLBs are flushed too and where its important that this
> > > happens before invalidate_range_end() is called.
> > > 
> > > Any comments and review appreciated!
> > 
> > The patchset looks decent, although I find it had to review because I
> > just wasn't provided with enough of the thinking that went into it.  I
> > have enough info to look at the C code, but not enough info to identify
> > and evaluate alternative implementation approaches, to identify
> > possible future extensions, etc.
> > 
> > The patchset does appear to add significant additional overhead to hot
> > code paths when mm_has_notifiers(mm).  Please let's update the
> > changelog to address this rather important concern.  How significant is
> > the impact on such mm's, how common are such mm's now and in the
> > future, should we (for example) look at short-circuiting
> > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > implement ->invalidate_range(), etc.
> 
> So one might feel like just completely removing the range_start()/end()
> from the mmu_notifier and stick to this one callback but it will not work
> with other hardware like the one i am doing HMM patchset for (i send it
> again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> 
> Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> sadly recently the GPU folks (which i am part of too). So as the GPU are
> starting to use it we will see a lot more application going through the
> mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> you do want to use the same address space on the GPU as on the CPU and
> thus you do want to share or at least keep synchronize the GPU view of
> the CPU page table.
> 
> Right now, for IOMMUv2 this means adding this callback, for device that
> do not rely on ATS/PASID PCIE extension this means something like HMM.
> Also note that HMM is a superset of IOMMUv2 as it could be use at the
> same time and provide more feature mainly allowing migrating some page
> to device local memory for performances purposes.

OK, but none of that addressed my question regarding the performance
cost.

> I think this sumup all motivation behind this patchset and also behind
> my other patchset. As usual i am happy to discuss alternative way to do
> things but i think that the path of least disruption from current code
> is the one implemented by those patchset.

"least disruption" is nice, but "good implementation" is better.  In
other words I'd prefer the more complex implementation if the end
result is better.  Depending on the magnitudes of "complex" and
"better" :) Two years from now (which isn't very long), I don't think
we'll regret having chosen the better implementation.

Does this patchset make compromises to achieve low disruption?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>