Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree

Will Deacon <will.deacon@xxxxxxx> · Mon, 3 Jun 2019 11:30:09 +0100

Hi Nick,

On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
> Peter Zijlstra's on May 31, 2019 7:49 pm:
> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
> >> I don't think it's very nice to set fullmm and freed_tables for this 
> >> case though. Is this concurrent zapping an important fast path? It
> >> must have been, in order to justify all this complexity to the mm, so
> >> we don't want to tie this boat anchor to it AFAIKS?
> > 
> > I'm not convinced its an important fast path, afaict it is an
> > unfortunate correctness issue caused by allowing concurrenct frees.
> 
> I mean -- concurrent freeing was an important fastpath, right?
> And concurrent freeing means that you hit this case. So this
> case itself should be important too.

I honestly don't think we (Peter and I) know. Our first involvement with
this was because TLBs were found to contain stale user entries:

https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@xxxxxxxxxx/

so the initial work to support the concurrent freeing was done separately
and, I assume, motivated by some real workloads. I would also very much
like to know more about that, since nothing remotely realistic has surfaced
in this discussion, other than some historical glibc thing which has long
since been fixed.

> >> Is the problem just that the freed page tables flags get cleared by
> >> __tlb_reset_range()? Why not just remove that then, so the bits are
> >> set properly for the munmap?
> > 
> > That's insufficient; as argued in my initial suggestion:
> > 
> >   https://lkml.kernel.org/r/20190509103813.GP2589@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > 
> > Since we don't know what was flushed by the concorrent flushes, we must
> > flush all state (page sizes, tables etc..).
> 
> Page tables should not be concurrently freed I think. Just don't clear
> those page table free flags and it should be okay. Page sizes yes,
> but we accommodated for that in the arch code. I could see reason to
> add a flag to the gather struct like "concurrent_free" and set that
> from the generic code, which the arch has to take care of.

I think you're correct that two CPUs cannot free the page tables
concurrently (I misunderstood this initially), although I also think
there may be some subtle issues if tlb->freed_tables is not set,
depending on the architecture. Roughly speaking, if one CPU is clearing
a PMD as part of munmap() and another CPU in madvise() does only last-level
TLB invalidation, then I think there's the potential for the invalidation
to be ineffective if observing a cleared PMD doesn't imply that the last
level has been unmapped from the perspective of the page-table walker.

> > But it looks like benchmarks (for the one test-case we have) seem to
> > favour flushing the world over flushing a smaller range.
> 
> Testing on 16MB unmap is possibly not a good benchmark, I didn't run
> it exactly but it looks likely to go beyond the range flush threshold
> and flush the entire PID anyway.

If we can get a better idea of what a "good benchmark" might look like (i.e.
something that is representative of the cases in which real workloads are
likely to trigger this path) then we can definitely try to optimise around
that.

In the meantime, I would really like to see this patch land in mainline
since it fixes a regression.

Will