Hi Linus, On Mon, Apr 05, 2010 at 01:58:57PM -0700, Linus Torvalds wrote: > What I'm asking for is this thing called "Does it actually work in > REALITY". That's my point about "not just after a clean boot". > > Just to really hit the issue home, here's my current machine: > > [root@i5 ~]# free > total used free shared buffers cached > Mem: 8073864 1808488 6265376 0 75480 1018412 > -/+ buffers/cache: 714596 7359268 > Swap: 10207228 12848 10194380 > > Look, I have absolutely _sh*tloads_ of memory, and I'm not using it. > Really. I've got 8GB in that machine, it's just not been doing much more > than a few "git pull"s and "make allyesconfig" runs to check the current > kernel and so it's got over 6GB free. > > So I'm bound to have _tons_ of 2M pages, no? > > No. Lookie here: > > [344492.280001] DMA: 1*4kB 1*8kB 1*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15836kB > [344492.280020] DMA32: 17516*4kB 19497*8kB 18318*16kB 15195*32kB 10332*64kB 5163*128kB 1371*256kB 123*512kB 2*1024kB 1*2048kB 0*4096kB = 2745528kB > [344492.280027] Normal: 57295*4kB 66959*8kB 39639*16kB 29486*32kB 10483*64kB 2366*128kB 398*256kB 100*512kB 27*1024kB 3*2048kB 0*4096kB = 3503268kB > > just to help you parse that: this is a _lightly_ loaded machine. It's been > up for about four days. And look at it. > > In case you can't read it, the relevant part is this part: > > DMA: .. 1*2048kB 3*4096kB > DMA32: .. 1*2048kB 0*4096kB > Normal: .. 3*2048kB 0*4096kB > > there is just a _small handful_ of 2MB pages. Seriously. On a machine with > 8 GB of RAM, and three quarters of it free, and there is just a couple of > contiguous 2MB regions. Note, that's _MB_, not GB. What I can provide is my current status so far on workstation: $ free total used free shared buffers cached Mem: 1923648 1410912 512736 0 332236 391000 -/+ buffers/cache: 687676 1235972 Swap: 4200960 14204 4186756 $ cat /proc/buddyinfo Node 0, zone DMA 46 34 30 12 16 11 10 5 0 1 0 Node 0, zone DMA32 33 355 352 129 46 1307 751 225 9 1 0 $ uptime 00:06:54 up 10 days, 5:10, 3 users, load average: 0.00, 0.00, 0.00 $ grep Anon /proc/meminfo AnonPages: 78036 kB AnonHugePages: 100352 kB And laptop: $ free total used free shared buffers cached Mem: 3076948 1964136 1112812 0 91920 297212 -/+ buffers/cache: 1575004 1501944 Swap: 2939888 17668 2922220 $ cat /proc/buddyinfo Node 0, zone DMA 26 9 8 3 3 2 2 1 1 3 1 Node 0, zone DMA32 840 2142 6455 5848 5156 2554 291 52 30 0 0 $ uptime 00:08:21 up 17 days, 20:17, 5 users, load average: 0.06, 0.01, 0.00 $ grep Anon /proc/meminfo AnonPages: 856332 kB AnonHugePages: 272384 kB this is with: $ cat /sys/kernel/mm/transparent_hugepage/defrag always madvise [never] $ cat /sys/kernel/mm/transparent_hugepage/khugepaged/defrag [yes] no Currently the "defrag" sysfs control only toggles __GFP_WAIT from on/off in huge_memory.c (details in the patch with subject "transparent hugepage core" in the alloc_hugepage() function). Toggling __GFP_WAIT is a joke right now. The real deal to address your worry is first to run "hugeadm --set-recommended-min_free_kbytes" and to apply Mel's patches called "memory compaction" which is a separate patchset. I'm the consumer, Mel's the producer ;). With virtual machines the host kernel doesn't need to live forever (it has to be stable but we can easily reboot it without guest noticing), we can migrate virtual machines to fresh booted new hosts voiding the whole producer issue. Furthermore VM the first time are usually started at host boot time, and we want as much memory as possible backed by hugepages in the host. This is not to say that the producer isn't important or can't work, Mel posted number that shows it works, and we definitely want it to work, but I'm just trying to make a point that a good consumer of plenty of hugepages available at boot is useful even assuming the producer won't ever work or won't ever get it (not the real life case we're dealing with!). Initially we're going to take advantage of only the consumer in production exactly because it's already useful, even if we want to take advantage of a smart runtime "producer" too later on as time goes on. Migrating guests to produce hugepages isn't the ideal way for sure and I'm very confident that Mel's work already filling the gap very nicely. The VM itself (regardless if the consumer is hugetlbfs or transparent hugepage support) is evolving towards being able to generated endless amount of hugepages (in 2M size, 1G still unthinkable because of the huge cost) as shown by the already mainline available "hugeadm --set-recommended-min_free_kbytes". BTW, I think having this 10 liner algorithm in userland hugeadm binary is wrong and it should be a separate sysctl like "echo 1 >/sys/kernel/vm/set-recommended-min_free_kbytes", but that's offtopic and an implementation detail... This is just to show they are already addressing that stuff for hugetlbfs. So I just created a better consumer for the stuff they make an effort to produce anyway (i.e. 2M pages). The better consumer we have of it in the kernel, the more effort will be put into the producer. > And don't tell me that these things are easy to fix. Don't tell me that > the current VM is quite clean and can be harmlessly extended to deal with > this all. Just don't. Not when we currently have a totally unexplained > regression in the VM from the last scalability thing we did. Well the risk of regression with the consumer is little if disabled with sysfs so it'd be trivial to localize if it caused any problem. About memory compaction I think we should limit the invocation of those new VM algorithms to hugetlbfs and transparent hugepage support (and I already created the sysfs controls to enable/disable those so you can run transparent hugepage support with or without defrag feature). So all of this can be turned off at runtime. You can run only the consumer, both consumer or producer, or none (and if none, risk of regression should be zero). There's no point to ever defrag if there is no consumer of 2M pages. khugepaged should be able to invoke memory compaction comfortably in the defrag job in the background if khugepaged/defrag is set to "yes". I think worrying about the producer too much generates a chicken egg problem, without an heavy consumer in mainline, there's little point for people to work on the producer. Note that creating a good producer wasn't easy task, I did all I could to keep it self contained and I think I succeeded at that. My work as result created interest into improving the producer on Mel's side. I am sure if the consumer goes in, producing the stuff will also happen without much problems. My preferred merging patch is to merge the consumer first. But then I'm not entirely against the other order too. Merging both at the same time to me looks unnecessary complexity merged in the kernel at the same time and it'd make things less bisectable. But it wouldn't be impossible either. About the performance benefits I posted some numbers in linux-mm, but I'll collect it here (and this is after boot with plenty of hugepages). As a side note in this first part please note also the boost in the page fault rate (but this really only for curiosity, as this will only happen with hugepages are immediately available in the buddy). ------------ hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). [..] Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ ------------- This is a more interesting benchmark of kernel compile and some random cpu bound dd command (not a microbenchmark like above): ----------- This is a kernel build in a 2.6.31 guest, on a 2.6.34-rc1 host. KVM run with "-drive cache=on,if=virtio,boot=on and -smp 4 -m 2g -vnc :0" (host has 4G of ram). CPU is Phenom (not II) with NPT (4 cores, 1 die). All reads are provided from host cache and cpu overhead of the I/O is reduced thanks to virtio. Workload is just a "make clean >/dev/null; time make -j20 >/dev/null". Results copied by hand because I logged through vnc. real 4m12.498s 14m28.106s 1m26.721s real 4m12.000s 14m27.850s 1m25.729s After the benchmark: grep Anon /proc/meminfo AnonPages: 121300 kB AnonHugePages: 1007616 kB cat /debugfs/kvm/largepages 2296 1.6G free in guest and 1.5free in host. Then on host: # echo never > /sys//kernel/mm/transparent_hugepage/enabled # echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled then I restart the VM and re-run the same workload: real 4m25.040s user 15m4.665s sys 1m50.519s real 4m29.653s user 15m8.637s sys 1m49.631s (guest kernel was not so recent and it had no transparent hugepage support because gcc normally won't take advantage of hugepages according to /proc/meminfo, so I made the comparison with a distro guest kernel with my usual .config I use in kvm guests) So guest compile the kernel 6% faster with hugepages and the results are trivially reproducible and stable enough (especially with hugepage enabled, without it varies from 4m24 sto 4m30s as I tried a few times more without hugepages in NTP when userland wasn't patched yet...). Below another test that takes advantage of hugepage in guest too, so running the same 2.6.34-rc1 with transparent hugepage support in both host and guest. (this really shows the power of KVM design, we boost the hypervisor and we get double boost for guest applications) Workload: time dd if=/dev/zero of=/dev/null bs=128M count=100 Host hugepage no guest: 3.898 Host hugepage guest hugepage: 3.966 (-1.17%) Host no hugepage no guest: 4.088 (-4.87%) Host hugepage guest no hugepage: 4.312 (-10.1%) Host no hugepage guest hugepage: 4.388 (-12.5%) Host no hugepage guest no hugepage: 4.425 (-13.5%) Workload: time dd if=/dev/zero of=/dev/null bs=4M count=1000 Host hugepage no guest: 1.207 Host hugepage guest hugepage: 1.245 (-3.14%) Host no hugepage no guest: 1.261 (-4.47%) Host no hugepage guest no hugepage: 1.323 (-9.61%) Host no hugepage guest hugepage: 1.371 (-13.5%) Host no hugepage guest no hugepage: 1.398 (-15.8%) I've no local EPT system to test so I may run them over vpn later on some large EPT system (and surely there are better benchs than a silly dd... but this is a start and shows even basic stuff gets the boost). The above is basically an "home-workstation/laptop" coverage. I (partly) intentionally run these on a system that has a ~$100 CPU and ~$50 motherboard, to show the absolute worst case, to be sure that 100% of home end users (running KVM) will take a measurable advantage from this effort. On huge systems the percentage boost is expected much bigger than on the home-workstation above test of course. -------------- Again gcc is a kind of worst case for it but it also shows a definitive significant and reproducible boost. Also note for a non-virtualization usage (so outside of MADV_HUGEPAGE), invoking memory compaction synchronously is likely a risk of losing CPU speed. khugepaged takes care of long lived allocations of random tasks and the only thing to use memory compaction synchronously could be the page faults of regions marked MADV_HUGEPAGE. But we may only decide to invoke memory compaction asynchronously and never as result of direct reclaim in process context to avoid any latency to guest operations. All it matters after boot is that khugepaged can do its job, it's not urgent. When things are urgent migrating guests to a new cloud node is always possible. I'd like to clarify this whole work has been done without ever making assumptions about virtual machines, I tried to make this as universally useful as possible (and not just because we want the exact same VM algorithms to trim one level of guest pagetables too to get a comulative boost so fully exploiting the KVM design ;). I'm thrilled Chris is going to test a host-only test for database and I'm sure willing to help with that. Compacting everything that is "movable" is surely solvable from a theoretical standpoint and that includes all anonymous memory (huge or not) and all cache. That alone accounts for an huge bulk of the total memory of a system, so being able to mix it all will result in the best behavior which isn't possible to achieve with hugetlbfs (so if the memory isn't allocated as anonymous memory can still be used as cache for I/O). So in the very worst case, if everything else fails on the producer front (again: not the case as far as I can tell!) what should be reserved at boot is an amount of memory to limit the unmovable parts there. And to leave the movable parts free to be allocated dynamically without limitations depending on the workloads. I'm quite sure Mel will be able to provide more details on his work that has been reviewed in detail already on linux-mm with lots of positive feedback which is why I expect zero problems on that side too in real life (besides my theoretical standpoint in previous chapter ;). Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>