Re: [PATCH 0/3] TLB flush multiple pages per IPI v5

Dave Hansen <dave.hansen@xxxxxxxxx> · Tue, 09 Jun 2015 08:34:35 -0700

On 06/09/2015 05:43 AM, Ingo Molnar wrote:
> I cited very real numbers about the direct costs of TLB flushes, and plausible 
> speculation about why the indirect costs are low on the achitecture you are trying 
> to modify here.

We should be careful to extrapolate what the real-world cost of a TLB
flush is from the cost of running a kernel function in a loop.

Let's look at what got measured:

> +static char tlb_flush_target[PAGE_SIZE] __aligned(4096);
> +static void fn_flush_tlb_one(void)
> +{
> +	unsigned long addr = (unsigned long)&tlb_flush_target;
> +
> +	tlb_flush_target[0]++;
> +	__flush_tlb_one(addr);
> +}

So we've got an increment of a variable in kernel memory (which is
almost surely in the L1), then we flush that memory location, and repeat
the increment.

I assume the increment is so that the __flush_tlb_one() has some "real"
work to do and is not just flushing an address which is not in the TLB.
 This is almost certainly a departure from workloads like Mel is
addressing where we (try to) flush pages used long ago that will
hopefully *not* be in the TLB.

But, that unfortunately means that we're measuring a TLB _miss_ here in
addition to the flush.  A TLB miss shouldn't be *that* expensive, right?
 The SDM says: "INVLPG also invalidates all entries in all
paging-structure caches ... regardless of the linear addresses to which
they correspond."  Ugh, so the TLB refill has to also refill the paging
structure caches.  At least the page tables will be in the L1.

Since "tlb_flush_target" is in kernel mapping, you might also be
shooting down the TLB entry for kernel text, or who knows what else.
The TLB entry might be 1G or 2M which might never be in the STLB
(second-level TLB), which could have *VERY* different behavior than a 4k
flush or a flush of an entry in the first-level TLB.

I'm not sure that these loop-style tests are particularly valuable, but
if we're going to do it, I think we should consider:
1. We need to separate the TLB fill portion from the flush and not
   measure any part of a fill along with the flush
2. We should measure flushing of ascending, adjacent virtual addresses
   mapped with 4k pages since that is the normal case.  Perhaps
   vmalloc(16MB) or something.
3. We should flush a mix of virtual addresses that are in and out of the
   TLB.
4. To measure instruction (as opposed to instruction+software)
   overhead, use __flush_tlb_single(), not __flush_tlb_one()

P.S.  I think I'm responsible for it, but we should probably also move
the count_vm_tlb_event() to outside the loop in flush_tlb_mm_range().
invlpg is not a "normal" instruction and could potentially increase the
overhead of incrementing the counter.  But, I guess the kernel mappings
_should_ stay in the TLB over an invlpg and shouldn't pay any cost to be
refilled in to the TLB despite the paging-structure caches going away.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>