Hi Ingo, On Tue, Apr 06, 2010 at 11:08:13AM +0200, Ingo Molnar wrote: > The goal of Andrea's and Mel's patch-set, to make this 'final performance > boost' more practical seems like a valid technical goal. The integration in my current git tree (#19+): git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git later -> git fetch; git checkout -f origin/master is working great and runs rock solid after the last integration bugfix in migrate.c, enjoy! ;) This is on my workstation, after building a ton of packages (including javac binaries and all sort of other random stuff), lots of kernels, mutt on large maildir folders, and running lots of ebuild that is super heavy in vfs terms. # free total used free shared buffers cached Mem: 3923408 2536380 1387028 0 482656 1194228 -/+ buffers/cache: 859496 3063912 Swap: 4200960 788 4200172 # uptime 20:09:50 up 1 day, 13:19, 11 users, load average: 0.00, 0.00, 0.00 # cat /proc/buddyinfo /proc/extfrag_index /proc/unusable_index Node 0, zone DMA 4 2 3 2 2 0 1 0 1 1 3 Node 0, zone DMA32 10402 32864 10477 3729 2154 1156 471 136 22 50 41 Node 0, zone Normal 196 155 40 21 16 7 4 1 0 2 0 Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.992 Node 0, zone DMA 0.000 0.001 0.002 0.005 0.009 0.017 0.017 0.033 0.033 0.097 0.226 Node 0, zone DMA32 0.000 0.030 0.223 0.347 0.434 0.536 0.644 0.733 0.784 0.801 0.876 Node 0, zone Normal 0.000 0.072 0.185 0.244 0.306 0.400 0.482 0.576 0.623 0.623 1.000 # time echo 3 > /proc/sys/vm/drop_caches real 0m0.989s user 0m0.000s sys 0m0.984s # time echo > /proc/sys/vm/compact_memory real 0m0.195s user 0m0.000s sys 0m0.124s # cat /proc/buddyinfo /proc/extfrag_index /proc/unusable_index Node 0, zone DMA 4 2 3 2 2 0 1 0 1 1 3 Node 0, zone DMA32 1632 1444 1336 1065 748 449 229 128 59 50 685 Node 0, zone Normal 1046 783 552 367 261 176 116 82 50 43 15 Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA 0.000 0.001 0.002 0.005 0.009 0.017 0.017 0.033 0.033 0.097 0.226 Node 0, zone DMA32 0.000 0.001 0.005 0.012 0.022 0.037 0.054 0.072 0.092 0.111 0.142 Node 0, zone Normal 0.000 0.012 0.030 0.056 0.090 0.139 0.205 0.291 0.414 0.563 0.820 # free total used free shared buffers cached Mem: 3923408 295240 3628168 0 4636 23192 -/+ buffers/cache: 267412 3655996 Swap: 4200960 788 4200172 # grep Anon /proc/meminfo AnonPages: 210472 kB AnonHugePages: 102400 kB (now AnonPages includes AnonHugePages, for backwards compatibility, sorry about not having done it earlier, so ~50% of anon ram is in hugepages) MB of hugepages before drop_caches+compact_memory: >>> (41)*4+(52)*2 268 MB of hugepages after drop_caches+compact_memory: >>> (685+15)*4+(50+43)*2 2986 Total ram free: 3543 MB. 84% of the RAM not affected by unmovable stuff after huge vfs slab load for about 2 days. On laptop I got an huge swap storm that killed kdeinit4 with the oom killer while I was away (found the login back in kdm4 when I got back). that supposedly splitted all hugepages and now I after a while I got all hugepages back: # grep Anon /proc/meminfo AnonPages: 767680 kB AnonHugePages: 395264 kB # uptime 20:33:33 up 1 day, 13:45, 9 users, load average: 0.00, 0.00, 0.00 # dmesg|grep kill Out of memory: kill process 8869 (kdeinit4) score 320362 or a child # (50% of ram in hugepages and 400M more of hugepages immediately available after invoking drop_caches/compact_memory manually with the two sysctl) And if this isn't enough kernelcore= can also provide an even stronger guarantee to prevent unmovable stuff to spill over and start shrinking freeable slab before it's too late. The drop caches would be run by try_to_free_pages internally which is interlevated with the try_to_compact_pages calls of course, so this is to show the full potential of set_recommended_min_free_kbytes (in-kernel automatically run at late_initcall unless you boot with transparent_hugepage=0) and memory compaction, on top of the already compound-aware try_to_free_pages (in addition of the order fallback with movable/unmovable of set_recommended_min_free_kbytes). And without using kernelcore= but allowing ebuild and other heavy slab unmovable users to grow as much as they want and with only 3G of ram. The sluggishness of invoking alloc_pages with __GFP_WAIT from hugepage page faults (synchronously in direct reclaim) also completely gone away after I tracked it down to lumpy reclaim that I simply nuked. This is already fully usable and works great, and as Avi showed it boosts even a sort on host by 6%, think about HPC applications, and soon I hope to boost gcc on host by 6% (and of >15% in guest with NPT/EPT) by extending vm_end in 2M chunks in glibc, at least for those huge gcc builds taking >200M like translate.o of qemu-kvm... (so I hope soon gcc running on KVM guest, thanks to EPT/NPT, will run faster than on mainline kernel without transparent hugepages on bare metal). Now I'll add numa awareness by adding alloc_pages_vma and make a #20 release which is one last relevant bit... Then we may want to address smaps to show hugepages per process instead of only global in /proc/meminfo. The only tuning I might recommend to people benchmarking on top of current aa.git, is to compare the workloads with: echo always >/sys/kernel/mm/transparent_hugepage/defrag # default setting at boot echo never >/sys/kernel/mm/transparent_hugepage/defrag And also to speedup khugepaged by decreasing /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs (that will workaround the vm_end not being extended in 2M chunk). There's also one sysctl called /proc/sys/vm/extfrag_threshold that allows to tune memory compaction aggressiveness but I wouldn't twiddle with it, supposedly it'll go away and be replaced by a future exponential backoff based logic to interleave the try_to_compact_pages/try_to_free_pages optimally and more dynamically than the sysctl (discussion on linux-mm). But it's not an huge priority at the moment, it already works great like this and it absolutely never becomes sluggish and it's always responsive since I nuked lumpy-reclaim. The half jiffy average wait time definitely not necessary and it would be lost in the noise compared to addressing the major problem we had in calling try_to_free_pages with order = 9 and __GFP_WAIT. > In fact the whole maintenance thought process seems somewhat similar to the > TSO situation: the networking folks first rejected TSO based on complexity > arguments, but then was embraced after some time. Full agreement! I think everyone wants transparent hugepage, the only compliant I ever heard so far is from Christoph that has some slight preference on not introducing split_huge_page and going full hugepage everywhere, with native in gup immediately where GUP only returns head pages and every caller has to check PageTransHuge on them to see if it's huge or not. Changing several hundred of drivers in one go and with native swapping with hugepage backed swapcache immediately, which means also pagecache has to deal with hugepages immediately, is possible too, but I think this more gradual approach is easier to keep under control, Rome wasn't built in a day. Surely in a second time I want tmpfs backed by hugepages too at least. And maybe pagecache, but it doesn't need to happen immediately. Also we've to keep in mind for huge systems the PAGE_SIZE should eventually become 2M and those will be able to take advantage of transparent hugepages for the 1G pud_trans_huge, that will make HPC even faster. Anyway nothing prevents to take Christoph's long term direction also by starting self contained. To me what is relevant is that everyone in the VM camp seems to want transparent hugepages in some shape or form, because of the about linear speedup they provide to everything running on them on bare metal (and an more than linear cumulative speedup in case of nested pagetables for obvious reasons), no matter what design that it is. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>