Re: [PATCH] mm: Flush the TLB for a single address in a huge page

Catalin Marinas <catalin.marinas@xxxxxxx> · Thu, 23 Jul 2015 17:16:45 +0100

On Thu, Jul 23, 2015 at 03:41:24PM +0100, Dave Hansen wrote:
> On 07/23/2015 07:13 AM, Andrea Arcangeli wrote:
> > On Thu, Jul 23, 2015 at 11:49:38AM +0100, Catalin Marinas wrote:
> >> On Thu, Jul 23, 2015 at 12:05:21AM +0100, Dave Hansen wrote:
> >>> On 07/22/2015 03:48 PM, Catalin Marinas wrote:
> >>>> You are right, on x86 the tlb_single_page_flush_ceiling seems to be
> >>>> 33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
> >>>> always. I would say a single page TLB flush is more efficient than a
> >>>> whole TLB flush but I'm not familiar enough with x86.
> >>>
> >>> The last time I looked, the instruction to invalidate a single page is
> >>> more expensive than the instruction to flush the entire TLB. 
> >>
> >> I was thinking of the overall cost of re-populating the TLB after being
> >> nuked rather than the instruction itself.
> > 
> > Unless I'm not aware about timing differences in flushing 2MB TLB
> > entries vs flushing 4kb TLB entries with invlpg, the benchmarks that
> > have been run to tune the optimal tlb_single_page_flush_ceiling value,
> > should already guarantee us that this is a valid optimization (as we
> > just got one entry, we're not even close to the 33 ceiling that makes
> > it more a grey area).
> 
> We had a discussion about this a few weeks ago:
> 
> 	https://lkml.org/lkml/2015/6/25/666
> 
> The argument is that the CPU is so good at refilling the TLB that it
> rarely waits on it, so the "cost" can be very very low.

Interesting thread. I can see from Ingo's benchmarks that invlpg is much
more expensive than the cr3 write but I can't really comment on the
refill cost (it may be small with page table caching in L1/L2). The
problem with small/targeted benchmarks is that you don't see the overall
impact.

On ARM, most recent CPUs can cache intermediate page table levels in the
TLB (usually as VA->pte translation). ARM64 introduces a new TLB
flushing instruction that only touches the last level (pte, huge pmd).
In theory this should be cheaper overall since the CPU doesn't need to
refill intermediate levels. In practice, it's probably lost in the
noise.

Anyway, if you want to keep the option of a full TLB flush for x86 on
huge pages, I'm happy to repost a v2 with a separate
flush_tlb_pmd_huge_page that arch code can define as it sees fit.

-- 
Catalin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href