Excerpts from Eric Dumazet's message of August 22, 2020 1:38 am: > > On 8/21/20 8:12 AM, Nicholas Piggin wrote: >> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC >> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and >> supports PMD sized vmap mappings. >> >> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or >> larger, and fall back to small pages if that was unsuccessful. >> >> Allocations that do not use PAGE_KERNEL prot are not permitted to use huge >> pages, because not all callers expect this (e.g., module allocations vs >> strict module rwx). >> >> This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node >> POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%. >> >> This can result in more internal fragmentation and memory overhead for a >> given allocation, an option nohugevmalloc is added to disable at boot. >> >> > > Thanks for working on this stuff, I tried something similar in the past, > but could not really do more than a hack. > ( https://lkml.org/lkml/2016/12/21/285 ) Oh nice. It might be possible to do some ideas from your patch still. Higher order pages smaller than PMD size, or the memory policy stuff, perhaps. > Note that __init alloc_large_system_hash() is used at boot time, > when NUMA policy is spreading allocations over all NUMA nodes. > > This means that on a dual node system, a hash table should be 50/50 spread. > > With your patch, if a hashtable is exactly the size of one huge page, > the location of this hashtable will be not balanced, this might have some > unwanted impact. In that case it shouldn't because it divides by the number of nodes, but it will in general have a bit larger granularity in balancing than smaller pages of course. There's probably a better way to size these important hashes on NUMA. I suspect most of the time you have a NUMA machine you actually would prefer to use large pages now, even if it means taking up to 2MB more memory per node per hash. It's not a great amount and the allocation size is rather arbitrary anyway. Thanks, Nick