On Jun 25, 2015 04:48, "Ingo Molnar" <mingo@xxxxxxxxxx> wrote:
>
> - 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages are accessed, the
> first byte in each
So that test is a bit unfair. From previous timing of Intel TLB fills, I can tell you that Intel is particularly good at doing adjacent entries.
That's independent of the fact that page tables have very good locality (if they are the radix tree type - the hashed page tables that ppc uses are shit). So when filling adjacent entries, you take the cache misses for the page tables only once, but even aside from that, Intel send to do particularly well at the "next page" TLB fill case
Now, I think that's a reasonably common case, and I'm not saying that it's unfair to compare for that reason, but it does highlight the good case for TLB walking.
So I would suggest you highlight the bad case too: use invlpg to invalidate *one* TLB entry, and then walk four non-adjacent entries. And compare *that* to the full TLB flush.
Now, I happen to still believe in the full flush, but let's not pick benchmarks that might not show the advantages of the finer granularity.
Linus