On Mon, 9 Jul 2018 17:50:03 -0700 (PDT) David Rientjes <rientjes@xxxxxxxxxx> wrote: > When perf profiling a wide variety of different workloads, it was found > that vmacache_find() had higher than expected cost: up to 0.08% of cpu > utilization in some cases. This was found to rival other core VM > functions such as alloc_pages_vma() with thp enabled and default > mempolicy, and the conditionals in __get_vma_policy(). > > VMACACHE_HASH() determines which of the four per-task_struct slots a vma > is cached for a particular address. This currently depends on the pfn, > so pfn 5212 occupies a different vmacache slot than its neighboring > pfn 5213. > > vmacache_find() iterates through all four of current's vmacache slots > when looking up an address. Hashing based on pfn, an address has > ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or > about 25%, *if* the vma is cached. > > This patch hashes an address by its pmd instead of pte to optimize for > workloads with good spatial locality. This results in a higher > probability of vmas being cached in the first slot that is checked: > normally ~70% on the same workloads instead of 25%. Was the improvement quantifiable? Surprised. That little array will all be in CPU cache and that loop should execute pretty quickly? If it's *that* sensitive then let's zap the no-longer-needed WARN_ON. And we could hide all the event counting behind some developer-only ifdef. Did you consider LRU-sorting the array instead?