Hi, Maybe you remember that I wrote few weeks ago about KVM cpu load problem with hugepages. The problem was lost hanging, however I have now some new information. So the description remains, however I have decreased both guest memory and the amount of hugepages: Ram = 8GB, hugepages = 3546 Total of 2 virual machines: 1. router with 32MB of RAM (hugepages) and 1VCPU 2. linux guest with 3500MB of RAM (hugepages) and 4VCPU Everything works fine until I start the second linux guest with the same 3500MB of guest RAM also in hugepages and also 4VCPU. The rest of description is the same as before: after a while the host shows loadaverage of about 8 (on a Core2Quad) and it seems that both big guests consume exactly the same amount of resources. The hosts seems responsive though. Inside the guests, however, things are not so good - the load sky rockets to at least 20. Guests are not responsive and even a 'ps' executes inappropriately slow (may take few minutes - here, however, load builds up and it seems that machine becomes slower with time, unlike host, which shows the jump in resource consumption instantly). It also seem that the more guests uses memory, the faster the problem appers. Still at least a gig of RAM is free on each guest and there is no swap activity inside the guest. The most important thing - why I went back and quoted older message than the last one, is that there is no more swap activity on host, so the previous track of thought may also be wrong and I returned to the beginning. There is plenty of RAM now and swap on host is always on 0 as seen in 'top'. And there is 100% cpu load, equally shared between the two large guests. To stop the load I can destroy either large guest. Additionally, I have just discovered that suspending any large guest works as well. Moreover, after resume, the load does not come back for a while. Both methods stop the high load instantly (faster than a second). As you were asking for a 'top' inside the guest, here it is: top - 03:27:27 up 42 min, 1 user, load average: 18.37, 7.68, 3.12 Tasks: 197 total, 23 running, 174 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 89.2%sy, 0.0%ni, 10.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 3510912k total, 1159760k used, 2351152k free, 62568k buffers Swap: 4194296k total, 0k used, 4194296k free, 484492k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12303 root 20 0 0 0 0 R 100 0.0 0:33.72 vpsnetclean 11772 99 20 0 149m 11m 2104 R 82 0.3 0:15.10 httpd 10906 99 20 0 149m 11m 2124 R 73 0.3 0:11.52 httpd 10247 99 20 0 149m 11m 2128 R 31 0.3 0:05.39 httpd 3916 root 20 0 86468 11m 1476 R 16 0.3 0:15.14 cpsrvd-ssl 10919 99 20 0 149m 11m 2124 R 8 0.3 0:03.43 httpd 11296 99 20 0 149m 11m 2112 R 7 0.3 0:03.26 httpd 12265 99 20 0 149m 11m 2088 R 7 0.3 0:08.01 httpd 12317 root 20 0 99.6m 1384 716 R 7 0.0 0:06.57 crond 12326 503 20 0 8872 96 72 R 7 0.0 0:01.13 php 3634 root 20 0 74804 1176 596 R 6 0.0 0:12.15 crond 11864 32005 20 0 87224 13m 2528 R 6 0.4 0:30.84 cpsrvd-ssl 12275 root 20 0 30628 9976 1364 R 6 0.3 0:24.68 cpgs_chk 11305 99 20 0 149m 11m 2104 R 6 0.3 0:02.53 httpd 12278 root 20 0 8808 1328 968 R 6 0.0 0:04.63 sim 1534 root 20 0 0 0 0 S 6 0.0 0:03.29 flush-254:2 3626 root 20 0 149m 13m 5324 R 6 0.4 0:27.62 httpd 12279 32008 20 0 87472 7668 2480 R 6 0.2 0:27.63 munin-update 10243 99 20 0 149m 11m 2128 R 5 0.3 0:08.47 httpd 12321 root 20 0 99.6m 1460 792 R 5 0.0 0:07.43 crond 12325 root 20 0 74804 672 92 R 5 0.0 0:00.76 crond 1531 root 20 0 0 0 0 S 2 0.0 0:02.26 kjournald 1 root 20 0 10316 756 620 S 0 0.0 0:02.10 init 2 root 20 0 0 0 0 S 0 0.0 0:00.01 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:01.08 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.02 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0 0.0 0:00.47 migration/1 7 root 20 0 0 0 0 S 0 0.0 0:00.03 ksoftirqd/1 8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 The tasks are changing in the 'top' view, so it is nothing like a single task hanging - it is more like a machine working off a swap. The problem is, however that according to vmstat, there is no swap activity during this time. Should I try to decrease RAM I give to my guests even more? Is it too much to have 3 guests with hugepages? Should I try something else? Unfortunately it is a production system and I can't play with it very much. Here is 'top' on the host: top - 03:32:12 up 25 days, 23:38, 2 users, load average: 8.50, 5.07, 10.39 Tasks: 133 total, 1 running, 132 sleeping, 0 stopped, 0 zombie Cpu(s): 99.1%us, 0.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 8193472k total, 8071776k used, 121696k free, 45296k buffers Swap: 11716412k total, 0k used, 11714844k free, 197236k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8426 libvirt- 20 0 3771m 27m 3904 S 199 0.3 10:28.33 kvm 8374 libvirt- 20 0 3815m 32m 3908 S 199 0.4 8:11.53 kvm 1557 libvirt- 20 0 225m 7720 2092 S 1 0.1 436:54.45 kvm 72 root 20 0 0 0 0 S 0 0.0 6:22.54 kondemand/3 379 root 20 0 0 0 0 S 0 0.0 58:20.99 md3_raid5 1 root 20 0 23768 1944 1228 S 0 0.0 0:00.95 init 2 root 20 0 0 0 0 S 0 0.0 0:00.24 kthreadd 3 root 20 0 0 0 0 S 0 0.0 0:12.66 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:07.58 migration/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0 0.0 0:15.05 migration/1 7 root 20 0 0 0 0 S 0 0.0 0:19.64 ksoftirqd/1 8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 9 root RT 0 0 0 0 S 0 0.0 0:07.21 migration/2 10 root 20 0 0 0 0 S 0 0.0 0:41.74 ksoftirqd/2 11 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/2 12 root RT 0 0 0 0 S 0 0.0 0:13.62 migration/3 13 root 20 0 0 0 0 S 0 0.0 0:24.63 ksoftirqd/3 14 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/3 15 root 20 0 0 0 0 S 0 0.0 1:17.11 events/0 16 root 20 0 0 0 0 S 0 0.0 1:33.30 events/1 17 root 20 0 0 0 0 S 0 0.0 4:15.28 events/2 18 root 20 0 0 0 0 S 0 0.0 1:13.49 events/3 19 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset 20 root 20 0 0 0 0 S 0 0.0 0:00.02 khelper 21 root 20 0 0 0 0 S 0 0.0 0:00.00 netns 22 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr 23 root 20 0 0 0 0 S 0 0.0 0:00.00 pm 25 root 20 0 0 0 0 S 0 0.0 0:02.47 sync_supers 26 root 20 0 0 0 0 S 0 0.0 0:03.86 bdi-default Please help... Thanks, Dmitry On Sat, Oct 2, 2010 at 1:30 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > > On Thu, Sep 30, 2010 at 12:07:15PM +0300, Dmitry Golubev wrote: > > Hi, > > > > I am not sure what's really happening, but every few hours > > (unpredictable) two virtual machines (Linux 2.6.32) start to generate > > huge cpu loads. It looks like some kind of loop is unable to complete > > or something... > > > > So the idea is: > > > > 1. I have two linux 2.6.32 x64 (openvz, proxmox project) guests > > running on linux 2.6.35 x64 (ubuntu maverick) host with a Q6600 > > Core2Quad on qemu-kvm 0.12.5 and libvirt 0.8.3 and another one small > > 32bit linux virtual machine (16MB of ram) with a router inside (i > > doubt it contributes to the problem). > > > > 2. All these machines use hufetlbfs. The server has 8GB of RAM, I > > reserved 3696 huge pages (page size is 2MB) on the server, and I am > > running the main guests each having 3550MB of virtual memory. The > > third guest, as I wrote before, takes 16MB of virtual memory. > > > > 3. Once run, the guests reserve huge pages for themselves normally. As > > mem-prealloc is default, they grab all the memory they should have, > > leaving 6 pages unreserved (HugePages_Free - HugePages_Rsvd = 6) all > > times - so as I understand they should not want to get any more, > > right? > > > > 4. All virtual machines run perfectly normal without any disturbances > > for few hours. They do not, however, use all their memory, so maybe > > the issue arises when they pass some kind of a threshold. > > > > 5. At some point of time both guests exhibit cpu load over the top > > (16-24). At the same time, host works perfectly well, showing load of > > 8 and that both kvm processes use CPU equally and fully. This point of > > time is unpredictable - it can be anything from one to twenty hours, > > but it will be less than a day. Sometimes the load disappears in a > > moment, but usually it stays like that, and everything works extremely > > slow (even a 'ps' command executes some 2-5 minutes). > > > > 6. If I am patient, I can start rebooting the gueat systems - once > > they have restarted, everything returns to normal. If I destroy one of > > the guests (virsh destroy), the other one starts working normally at > > once (!). > > > > I am relatively new to kvm and I am absolutely lost here. I have not > > experienced such problems before, but recently I upgraded from ubuntu > > lucid (I think it was linux 2.6.32, qemukvm 0.12.3 and libvirt 0.7.5) > > and started to use hugepages. These two virtual machines are not > > normally run on the same host system (i have a corosync/pacemaker > > cluster with drbd storage), but when one of the hosts is not > > abailable, they start running on the same host. That is the reason I > > have not noticed this earlier. > > > > Unfortunately, I don't have any spare hardware to experiment and this > > is a production system, so my debugging options are rather limited. > > > > Do you have any ideas, what could be wrong? > > Is there swapping activity on the host when this happens? > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html