Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree

Will Deacon <will.deacon@xxxxxxx> · Mon, 3 Jun 2019 18:57:19 +0100

Hi Nick,

On Tue, Jun 04, 2019 at 12:10:37AM +1000, Nicholas Piggin wrote:
> Will Deacon's on June 3, 2019 8:30 pm:
> > On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
> >> Peter Zijlstra's on May 31, 2019 7:49 pm:
> >> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
> >> >> I don't think it's very nice to set fullmm and freed_tables for this 
> >> >> case though. Is this concurrent zapping an important fast path? It
> >> >> must have been, in order to justify all this complexity to the mm, so
> >> >> we don't want to tie this boat anchor to it AFAIKS?
> >> > 
> >> > I'm not convinced its an important fast path, afaict it is an
> >> > unfortunate correctness issue caused by allowing concurrenct frees.
> >> 
> >> I mean -- concurrent freeing was an important fastpath, right?
> >> And concurrent freeing means that you hit this case. So this
> >> case itself should be important too.
> > 
> > I honestly don't think we (Peter and I) know. Our first involvement with
> > this was because TLBs were found to contain stale user entries:
> > 
> > https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@xxxxxxxxxx/
> > 
> > so the initial work to support the concurrent freeing was done separately
> > and, I assume, motivated by some real workloads. I would also very much
> > like to know more about that, since nothing remotely realistic has surfaced
> > in this discussion, other than some historical glibc thing which has long
> > since been fixed.
> 
> Well, it seems like it is important. While the complexity is carried
> in the mm, we should not skimp on this last small piece.

As I say, I really don't know. But yes, if we can do something better we
should.

> >> >> Is the problem just that the freed page tables flags get cleared by
> >> >> __tlb_reset_range()? Why not just remove that then, so the bits are
> >> >> set properly for the munmap?
> >> > 
> >> > That's insufficient; as argued in my initial suggestion:
> >> > 
> >> >   https://lkml.kernel.org/r/20190509103813.GP2589@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> >> > 
> >> > Since we don't know what was flushed by the concorrent flushes, we must
> >> > flush all state (page sizes, tables etc..).
> >> 
> >> Page tables should not be concurrently freed I think. Just don't clear
> >> those page table free flags and it should be okay. Page sizes yes,
> >> but we accommodated for that in the arch code. I could see reason to
> >> add a flag to the gather struct like "concurrent_free" and set that
> >> from the generic code, which the arch has to take care of.
> > 
> > I think you're correct that two CPUs cannot free the page tables
> > concurrently (I misunderstood this initially), although I also think
> > there may be some subtle issues if tlb->freed_tables is not set,
> > depending on the architecture. Roughly speaking, if one CPU is clearing
> > a PMD as part of munmap() and another CPU in madvise() does only last-level
> > TLB invalidation, then I think there's the potential for the invalidation
> > to be ineffective if observing a cleared PMD doesn't imply that the last
> > level has been unmapped from the perspective of the page-table walker.
> 
> That should not be the case because the last level table should have
> had all entries cleared before the pointer to it has been cleared.

The devil is in the detail here, and I think specifically it depends
what you mean by "before". Does that mean memory barrier, or special
page-table walker barrier, or TLB invalidation or ...?

> So the page table walker could begin from the now-freed page table,
> but it would never instantiate a valid TLB entry from there. So a
> TLB invalidation would behave properly despite not flushing page
> tables.
> 
> Powerpc at least would want to avoid over flushing here, AFAIKS.

For arm64 it really depends how often this hits. Simply not setting
tlb->freed_tables would also break things for us, because we have an
optimisation where we elide invalidation in the fullmm && !freed_tables
case, since this is indicative of the mm going away and so we simply
avoid reallocating its ASID.

> >> > But it looks like benchmarks (for the one test-case we have) seem to
> >> > favour flushing the world over flushing a smaller range.
> >> 
> >> Testing on 16MB unmap is possibly not a good benchmark, I didn't run
> >> it exactly but it looks likely to go beyond the range flush threshold
> >> and flush the entire PID anyway.
> > 
> > If we can get a better idea of what a "good benchmark" might look like (i.e.
> > something that is representative of the cases in which real workloads are
> > likely to trigger this path) then we can definitely try to optimise around
> > that.
> 
> Hard to say unfortunately. A smaller unmap range to start with, but
> even then when you have a TLB over-flushing case, then an unmap micro
> benchmark is not a great test because you'd like to see more impact of
> other useful entries being flushed (e.g., you need an actual working
> set).

Right, sounds like somebody needs to do some better analysis than what's
been done so far.

> > In the meantime, I would really like to see this patch land in mainline
> > since it fixes a regression.
> 
> Sorry I didn't provide input earlier. I would like to improve the fix or 
> at least make an option for archs to provide an optimised way to flush 
> this case, so it would be nice not to fix archs this way and then have 
> to change the fix significantly right away.

Please send patches ;)

> But the bug does need to be fixed of course, if there needs to be more
> thought about it maybe it's best to take this fix for next release.

Agreed.

Will