Hi, Sorry to bother you again. I have more info: > 1. router with 32MB of RAM (hugepages) and 1VCPU ... > Is it too much to have 3 guests with hugepages? OK, this router is also out of equation - I disabled hugepages for it. There should be also additional pages available to guests because of that. I think this should be pretty reproducible... Two exactly similar 64bit Linux 2.6.32 guests with 3500MB of virtual RAM and 4 VCPU each, running on a Core2Quad (4 real cores) machine with 8GB of RAM and 3546 2MB hugepages on a 64bit Linux 2.6.35 host (libvirt 0.8.3) from Ubuntu Maverick. Still no swapping and the effect is pretty much the same: one guest runs well, two guests work for some minutes - then slow down few hundred times, showing huge load both inside (unlimited rapid growth of loadaverage) and outside (host load is not making it unresponsive though - but loaded to the max). Load growth on host is instant and finite ('r' column change indicate this sudden rise): # vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 3 0 194220 30680 76712 0 0 319 28 2633 1960 6 6 67 20 1 2 0 193776 30680 76712 0 0 4 231 55081 78491 3 39 17 41 10 1 0 185508 30680 76712 0 0 4 87 53042 34212 55 27 9 9 12 0 0 185180 30680 76712 0 0 2 95 41007 21990 84 16 0 0 Thanks, Dmitry On Wed, Nov 17, 2010 at 4:19 AM, Dmitry Golubev <lastguru@xxxxxxxxx> wrote: > Hi, > > Maybe you remember that I wrote few weeks ago about KVM cpu load > problem with hugepages. The problem was lost hanging, however I have > now some new information. So the description remains, however I have > decreased both guest memory and the amount of hugepages: > > Ram = 8GB, hugepages = 3546 > > Total of 2 virual machines: > 1. router with 32MB of RAM (hugepages) and 1VCPU > 2. linux guest with 3500MB of RAM (hugepages) and 4VCPU > > Everything works fine until I start the second linux guest with the > same 3500MB of guest RAM also in hugepages and also 4VCPU. The rest of > description is the same as before: after a while the host shows > loadaverage of about 8 (on a Core2Quad) and it seems that both big > guests consume exactly the same amount of resources. The hosts seems > responsive though. Inside the guests, however, things are not so good > - the load sky rockets to at least 20. Guests are not responsive and > even a 'ps' executes inappropriately slow (may take few minutes - > here, however, load builds up and it seems that machine becomes slower > with time, unlike host, which shows the jump in resource consumption > instantly). It also seem that the more guests uses memory, the faster > the problem appers. Still at least a gig of RAM is free on each guest > and there is no swap activity inside the guest. > > The most important thing - why I went back and quoted older message > than the last one, is that there is no more swap activity on host, so > the previous track of thought may also be wrong and I returned to the > beginning. There is plenty of RAM now and swap on host is always on 0 > as seen in 'top'. And there is 100% cpu load, equally shared between > the two large guests. To stop the load I can destroy either large > guest. Additionally, I have just discovered that suspending any large > guest works as well. Moreover, after resume, the load does not come > back for a while. Both methods stop the high load instantly (faster > than a second). As you were asking for a 'top' inside the guest, here > it is: > > top - 03:27:27 up 42 min, 1 user, load average: 18.37, 7.68, 3.12 > Tasks: 197 total, 23 running, 174 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 89.2%sy, 0.0%ni, 10.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st > Mem: 3510912k total, 1159760k used, 2351152k free, 62568k buffers > Swap: 4194296k total, 0k used, 4194296k free, 484492k cached > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 12303 root 20 0 0 0 0 R 100 0.0 0:33.72 > vpsnetclean > 11772 99 20 0 149m 11m 2104 R 82 0.3 0:15.10 httpd > 10906 99 20 0 149m 11m 2124 R 73 0.3 0:11.52 httpd > 10247 99 20 0 149m 11m 2128 R 31 0.3 0:05.39 httpd > 3916 root 20 0 86468 11m 1476 R 16 0.3 0:15.14 > cpsrvd-ssl > 10919 99 20 0 149m 11m 2124 R 8 0.3 0:03.43 httpd > 11296 99 20 0 149m 11m 2112 R 7 0.3 0:03.26 httpd > 12265 99 20 0 149m 11m 2088 R 7 0.3 0:08.01 httpd > 12317 root 20 0 99.6m 1384 716 R 7 0.0 0:06.57 crond > 12326 503 20 0 8872 96 72 R 7 0.0 0:01.13 php > 3634 root 20 0 74804 1176 596 R 6 0.0 0:12.15 crond > 11864 32005 20 0 87224 13m 2528 R 6 0.4 0:30.84 > cpsrvd-ssl > 12275 root 20 0 30628 9976 1364 R 6 0.3 0:24.68 cpgs_chk > 11305 99 20 0 149m 11m 2104 R 6 0.3 0:02.53 httpd > 12278 root 20 0 8808 1328 968 R 6 0.0 0:04.63 sim > 1534 root 20 0 0 0 0 S 6 0.0 0:03.29 > flush-254:2 > 3626 root 20 0 149m 13m 5324 R 6 0.4 0:27.62 httpd > 12279 32008 20 0 87472 7668 2480 R 6 0.2 0:27.63 > munin-update > 10243 99 20 0 149m 11m 2128 R 5 0.3 0:08.47 httpd > 12321 root 20 0 99.6m 1460 792 R 5 0.0 0:07.43 crond > 12325 root 20 0 74804 672 92 R 5 0.0 0:00.76 crond > 1531 root 20 0 0 0 0 S 2 0.0 0:02.26 kjournald > 1 root 20 0 10316 756 620 S 0 0.0 0:02.10 init > 2 root 20 0 0 0 0 S 0 0.0 0:00.01 kthreadd > 3 root RT 0 0 0 0 S 0 0.0 0:01.08 > migration/0 > 4 root 20 0 0 0 0 S 0 0.0 0:00.02 > ksoftirqd/0 > 5 root RT 0 0 0 0 S 0 0.0 0:00.00 > watchdog/0 > 6 root RT 0 0 0 0 S 0 0.0 0:00.47 > migration/1 > 7 root 20 0 0 0 0 S 0 0.0 0:00.03 > ksoftirqd/1 > 8 root RT 0 0 0 0 S 0 0.0 0:00.00 > watchdog/1 > > > The tasks are changing in the 'top' view, so it is nothing like a > single task hanging - it is more like a machine working off a swap. > The problem is, however that according to vmstat, there is no swap > activity during this time. Should I try to decrease RAM I give to my > guests even more? Is it too much to have 3 guests with hugepages? > Should I try something else? Unfortunately it is a production system > and I can't play with it very much. > > Here is 'top' on the host: > > top - 03:32:12 up 25 days, 23:38, 2 users, load average: 8.50, 5.07, 10.39 > Tasks: 133 total, 1 running, 132 sleeping, 0 stopped, 0 zombie > Cpu(s): 99.1%us, 0.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st > Mem: 8193472k total, 8071776k used, 121696k free, 45296k buffers > Swap: 11716412k total, 0k used, 11714844k free, 197236k cached > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 8426 libvirt- 20 0 3771m 27m 3904 S 199 0.3 10:28.33 kvm > 8374 libvirt- 20 0 3815m 32m 3908 S 199 0.4 8:11.53 kvm > 1557 libvirt- 20 0 225m 7720 2092 S 1 0.1 436:54.45 kvm > 72 root 20 0 0 0 0 S 0 0.0 6:22.54 > kondemand/3 > 379 root 20 0 0 0 0 S 0 0.0 58:20.99 md3_raid5 > 1 root 20 0 23768 1944 1228 S 0 0.0 0:00.95 init > 2 root 20 0 0 0 0 S 0 0.0 0:00.24 kthreadd > 3 root 20 0 0 0 0 S 0 0.0 0:12.66 > ksoftirqd/0 > 4 root RT 0 0 0 0 S 0 0.0 0:07.58 > migration/0 > 5 root RT 0 0 0 0 S 0 0.0 0:00.00 > watchdog/0 > 6 root RT 0 0 0 0 S 0 0.0 0:15.05 > migration/1 > 7 root 20 0 0 0 0 S 0 0.0 0:19.64 > ksoftirqd/1 > 8 root RT 0 0 0 0 S 0 0.0 0:00.00 > watchdog/1 > 9 root RT 0 0 0 0 S 0 0.0 0:07.21 > migration/2 > 10 root 20 0 0 0 0 S 0 0.0 0:41.74 > ksoftirqd/2 > 11 root RT 0 0 0 0 S 0 0.0 0:00.00 > watchdog/2 > 12 root RT 0 0 0 0 S 0 0.0 0:13.62 > migration/3 > 13 root 20 0 0 0 0 S 0 0.0 0:24.63 > ksoftirqd/3 > 14 root RT 0 0 0 0 S 0 0.0 0:00.00 > watchdog/3 > 15 root 20 0 0 0 0 S 0 0.0 1:17.11 events/0 > 16 root 20 0 0 0 0 S 0 0.0 1:33.30 events/1 > 17 root 20 0 0 0 0 S 0 0.0 4:15.28 events/2 > 18 root 20 0 0 0 0 S 0 0.0 1:13.49 events/3 > 19 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset > 20 root 20 0 0 0 0 S 0 0.0 0:00.02 khelper > 21 root 20 0 0 0 0 S 0 0.0 0:00.00 netns > 22 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr > 23 root 20 0 0 0 0 S 0 0.0 0:00.00 pm > 25 root 20 0 0 0 0 S 0 0.0 0:02.47 > sync_supers > 26 root 20 0 0 0 0 S 0 0.0 0:03.86 > bdi-default > > > Please help... > > Thanks, > Dmitry > > On Sat, Oct 2, 2010 at 1:30 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: >> >> On Thu, Sep 30, 2010 at 12:07:15PM +0300, Dmitry Golubev wrote: >> > Hi, >> > >> > I am not sure what's really happening, but every few hours >> > (unpredictable) two virtual machines (Linux 2.6.32) start to generate >> > huge cpu loads. It looks like some kind of loop is unable to complete >> > or something... >> > >> > So the idea is: >> > >> > 1. I have two linux 2.6.32 x64 (openvz, proxmox project) guests >> > running on linux 2.6.35 x64 (ubuntu maverick) host with a Q6600 >> > Core2Quad on qemu-kvm 0.12.5 and libvirt 0.8.3 and another one small >> > 32bit linux virtual machine (16MB of ram) with a router inside (i >> > doubt it contributes to the problem). >> > >> > 2. All these machines use hufetlbfs. The server has 8GB of RAM, I >> > reserved 3696 huge pages (page size is 2MB) on the server, and I am >> > running the main guests each having 3550MB of virtual memory. The >> > third guest, as I wrote before, takes 16MB of virtual memory. >> > >> > 3. Once run, the guests reserve huge pages for themselves normally. As >> > mem-prealloc is default, they grab all the memory they should have, >> > leaving 6 pages unreserved (HugePages_Free - HugePages_Rsvd = 6) all >> > times - so as I understand they should not want to get any more, >> > right? >> > >> > 4. All virtual machines run perfectly normal without any disturbances >> > for few hours. They do not, however, use all their memory, so maybe >> > the issue arises when they pass some kind of a threshold. >> > >> > 5. At some point of time both guests exhibit cpu load over the top >> > (16-24). At the same time, host works perfectly well, showing load of >> > 8 and that both kvm processes use CPU equally and fully. This point of >> > time is unpredictable - it can be anything from one to twenty hours, >> > but it will be less than a day. Sometimes the load disappears in a >> > moment, but usually it stays like that, and everything works extremely >> > slow (even a 'ps' command executes some 2-5 minutes). >> > >> > 6. If I am patient, I can start rebooting the gueat systems - once >> > they have restarted, everything returns to normal. If I destroy one of >> > the guests (virsh destroy), the other one starts working normally at >> > once (!). >> > >> > I am relatively new to kvm and I am absolutely lost here. I have not >> > experienced such problems before, but recently I upgraded from ubuntu >> > lucid (I think it was linux 2.6.32, qemukvm 0.12.3 and libvirt 0.7.5) >> > and started to use hugepages. These two virtual machines are not >> > normally run on the same host system (i have a corosync/pacemaker >> > cluster with drbd storage), but when one of the hosts is not >> > abailable, they start running on the same host. That is the reason I >> > have not noticed this earlier. >> > >> > Unfortunately, I don't have any spare hardware to experiment and this >> > is a production system, so my debugging options are rather limited. >> > >> > Do you have any ideas, what could be wrong? >> >> Is there swapping activity on the host when this happens? >> > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html