Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree

Nicholas Piggin <npiggin@xxxxxxxxx> · Fri, 31 May 2019 12:46:56 +1000

Peter Zijlstra's on May 28, 2019 12:25 am:
> On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
>> On 5/27/19 4:31 PM, Peter Zijlstra wrote:
>> > On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@xxxxxxxxxxxxxxxxxxxx wrote:
>> > > --- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
>> > > +++ a/mm/mmu_gather.c
>> > > @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
>> > >   {
>> > >   	/*
>> > >   	 * If there are parallel threads are doing PTE changes on same range
>> > > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>> > > -	 * flush by batching, a thread has stable TLB entry can fail to flush
>> > > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>> > > -	 * forcefully if we detect parallel PTE batching threads.
>> > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
>> > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
>> > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
>> > > +	 * if we detect parallel PTE batching threads.
>> > > +	 *
>> > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
>> > > +	 * needs force flush everything in the given range. Otherwise this
>> > > +	 * may result in having stale TLB entries for some architectures,
>> > > +	 * e.g. aarch64, that could specify flush what level TLB.
>> > >   	 */
>> > >   	if (mm_tlb_flush_nested(tlb->mm)) {
>> > > +		/*
>> > > +		 * The aarch64 yields better performance with fullmm by
>> > > +		 * avoiding multiple CPUs spamming TLBI messages at the
>> > > +		 * same time.
>> > > +		 *
>> > > +		 * On x86 non-fullmm doesn't yield significant difference
>> > > +		 * against fullmm.
>> > > +		 */
>> > > +		tlb->fullmm = 1;
>> > >   		__tlb_reset_range(tlb);
>> > > -		__tlb_adjust_range(tlb, start, end - start);
>> > > +		tlb->freed_tables = 1;
>> > >   	}
>> > >   	tlb_flush_mmu(tlb);
>> > 
>> > Nick, Aneesh, can we now do this?

Sorry I meant to get to that but forgot.

>> > 
>> > ---
>> > 
>> > diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> > index 4d841369399f..8d28b83914cb 100644
>> > --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> > +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> > @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb)
>> >   	 */
>> >   	if (tlb->fullmm) {
>> >   		__flush_all_mm(mm, true);
>> > -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
>> > -	} else if (mm_tlb_flush_nested(mm)) {
>> > -		/*
>> > -		 * If there is a concurrent invalidation that is clearing ptes,
>> > -		 * then it's possible this invalidation will miss one of those
>> > -		 * cleared ptes and miss flushing the TLB. If this invalidate
>> > -		 * returns before the other one flushes TLBs, that can result
>> > -		 * in it returning while there are still valid TLBs inside the
>> > -		 * range to be invalidated.
>> > -		 *
>> > -		 * See mm/memory.c:tlb_finish_mmu() for more details.
>> > -		 *
>> > -		 * The solution to this is ensure the entire range is always
>> > -		 * flushed here. The problem for powerpc is that the flushes
>> > -		 * are page size specific, so this "forced flush" would not
>> > -		 * do the right thing if there are a mix of page sizes in
>> > -		 * the range to be invalidated. So use __flush_tlb_range
>> > -		 * which invalidates all possible page sizes in the range.
>> > -		 *
>> > -		 * PWC flush probably is not be required because the core code
>> > -		 * shouldn't free page tables in this path, but accounting
>> > -		 * for the possibility makes us a bit more robust.
>> > -		 *
>> > -		 * need_flush_all is an uncommon case because page table
>> > -		 * teardown should be done with exclusive locks held (but
>> > -		 * after locks are dropped another invalidate could come
>> > -		 * in), it could be optimized further if necessary.
>> > -		 */
>> > -		if (!tlb->need_flush_all)
>> > -			__radix__flush_tlb_range(mm, start, end, true);
>> > -		else
>> > -			radix__flush_all_mm(mm);
>> > -#endif
>> >   	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
>> >   		if (!tlb->need_flush_all)
>> >   			radix__flush_tlb_mm(mm);
>> > 
>> 
>> 
>> I guess we can revert most of the commit
>> 02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we flush
>> multiple page sizes? . But should we evaluate the performance impact of that
>> fullmm flush on ppc64?
> 
> Maybe, but given the patch that went into -mm, PPC will never hit that
> branch I killed anymore -- and that really shouldn't be in architecture
> code anyway.

Yeah well if mm/ does this then sure it's dead and can go.

I don't think it's very nice to set fullmm and freed_tables for this 
case though. Is this concurrent zapping an important fast path? It
must have been, in order to justify all this complexity to the mm, so
we don't want to tie this boat anchor to it AFAIKS?

Is the problem just that the freed page tables flags get cleared by
__tlb_reset_range()? Why not just remove that then, so the bits are
set properly for the munmap?

> Also; as I noted last time: __radix___flush_tlb_range() and
> __radix__flush_tlb_range_psize() look similar enough that they might
> want to be a single function (and instead of @flush_all_sizes, have it
> take @gflush, @hflush, @flush and @pwc).

Yeah, it could possibly use a cleanup pass. I have a few patches
sitting around to make a few more optimisations around the place,
was going to look at some refactoring after that.

Thanks,
Nick