Re: Nested KVM issue

Boris Derzhavets <bderzhavets@xxxxxxxxxxx> · Wed, 17 Aug 2016 12:10:02 +0000

For myself KSM is unpredictable feature. The problem is Compute, just this node
does "copy on write" , so only Compute.
My concern exactly is where would it lead to worse or better Guest behavior ?
I am not expecting complete fix.  I would track via top/htop  and dmesg via Cron on 1-2 hr
period.

From: centos-virt-bounces@xxxxxxxxxx <centos-virt-bounces@xxxxxxxxxx> on behalf of Laurentiu Soica <laurentiu@xxxxxxxx>

Sent: Wednesday, August 17, 2016 6:38 AM

To: Discussion about the virtualization on CentOS

Subject: Re:  Nested KVM issue

Both baremetal and compute ? Are there any other metrics do you consider useful to collect for troubleshooting purposes ?

În mie., 17 aug. 2016 la 13:04, Boris Derzhavets <bderzhavets@xxxxxxxxxxx> a scris:

It sounds weird, but attempt to disable KSM and see would it help or no ?

From:
centos-virt-bounces@xxxxxxxxxx <centos-virt-bounces@xxxxxxxxxx> on behalf of Laurentiu Soica <laurentiu@xxxxxxxx>

Sent: Wednesday, August 17, 2016 4:56 AM

To: Discussion about the virtualization on CentOS

Subject: Re:  Nested KVM issue

Enabled the logging on both compute and baremetal. Nothing strange in logs:

on baremetal :

Wed Aug 17 11:51:01 EEST 2016: committed 62310764 free 58501808
Wed Aug 17 11:51:01 EEST 2016: 87025667 < 123574516 and free > 24714903, stop ksm

on compute:

Wed Aug 17 08:52:52 UTC 2016: committed 24547132 free 76730936
Wed Aug 17 08:52:52 UTC 2016: 45139624 < 102962460 and free > 20592492, stop ksm

and the compute node is again at 100% CPU utilization.

În mar., 16 aug. 2016 la 15:26, Boris Derzhavets <bderzhavets@xxxxxxxxxxx> a scris:

I would enable ksmtuned logging ,if it has been done verify logs

From:
centos-virt-bounces@xxxxxxxxxx <centos-virt-bounces@xxxxxxxxxx> on behalf of Laurentiu Soica <laurentiu@xxxxxxxx>

Sent: Tuesday, August 16, 2016 7:16 AM

To: Discussion about the virtualization on CentOS

Subject: Re:  Nested KVM issue

Yes. It is on both baremetal and compute node.

În mar., 16 aug. 2016 la 13:37, Boris Derzhavets <bderzhavets@xxxxxxxxxxx> a scris:

Is  KSM enabled on your Compute Nodes ( presuming CentOS 7.2 on bare metal ) ?

From:
centos-virt-bounces@xxxxxxxxxx <centos-virt-bounces@xxxxxxxxxx> on behalf of Laurentiu Soica <laurentiu@xxxxxxxx>

Sent: Tuesday, August 16, 2016 5:25 AM

To: Discussion about the virtualization on CentOS

Subject: Re:  Nested KVM issue

Running the compute node for several days simply triggers it.

În mar., 16 aug. 2016 la 12:12, Boris Derzhavets <bderzhavets@xxxxxxxxxxx> a scris:

Sorry,
How you trigger the problem ?
B.

From:
centos-virt-bounces@xxxxxxxxxx <centos-virt-bounces@xxxxxxxxxx> on behalf of Laurentiu Soica <laurentiu@xxxxxxxx>

Sent: Tuesday, August 16, 2016 3:28 AM

To: Discussion about the virtualization on CentOS

Subject: Re:  Nested KVM issue

Hello,

The issue reproduced again and it doesn't look like a swap problem. Some details:

on the baremetal, from top:

top - 08:08:52 up 5 days, 16:43,  3 users,  load average: 36.19, 36.05, 36.05
Tasks: 493 total,   1 running, 492 sleeping,   0 stopped,   0 zombie

%Cpu(s):  3.5 us, 87.9 sy,  0.0 ni,  8.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 12357451+total, 14296000 free, 65634428 used, 43644088 buff/cache
KiB Swap:  4194300 total,  4073868 free,   120432 used. 56953888 avail Mem

19158 qemu      20   0  0.098t 0.041t  10476 S  3650 35.6  13048:24 qemu-kvm

The compute node has 36 CPUs and the usage is now 100%. There are more than 50 GB of memory still available on the baremetal. The swap is barely used, 120 MB.

On compute node, from top:

top - 05:11:58 up 1 day, 15:08,  2 users,  load average: 40.46, 40.49, 40.74

%Cpu(s): 99.1 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.1 st
KiB Mem : 10296246+total, 78079936 free, 23671360 used,  1211160 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 78939968 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6032 qemu      20   0 10.601g 1.272g  12964 S 400.0  1.3 588:40.39 qemu-kvm
 5673 qemu      20   0 10.602g 1.006g  13020 S 399.7  1.0   1161:47 qemu-kvm
 5998 qemu      20   0 10.601g 1.192g  13028 S 367.9  1.2   1544:30 qemu-kvm
 5951 qemu      20   0 10.601g 1.246g  13020 S 348.3  1.3   1547:38 qemu-kvm
 5750 qemu      20   0 10.599g 990136  13060 S 339.1  1.0   1152:25 qemu-kvm
 5752 qemu      20   0 10.598g 1.426g  13040 S 313.9  1.5 663:13.65 qemu-kvm
....

There are more than 70 GB of memory available on the compute node. All VMs are using 100% their CPUs and they are not accessible anymore.

Laurentiu 

În dum., 14 aug. 2016 la 21:44, Boris Derzhavets <bderzhavets@xxxxxxxxxxx> a scris:

From:
centos-virt-bounces@xxxxxxxxxx <centos-virt-bounces@xxxxxxxxxx> on behalf of Laurentiu Soica <laurentiu@xxxxxxxx>

Sent: Sunday, August 14, 2016 10:17 AM

To: Discussion about the virtualization on CentOS

Subject: Re:  Nested KVM issue 

More details on the subject: 

I suppose it is a nested KVM issue because it raised after I enabled the nested KVM feature. Without it, anyway, the second level VMs are unusable in terms of performance.

I am using CentOS 7 with:

kernel: 3.10.0-327.22.2.el7.x86_64
qemu-kvm:1.5.3-105.el7_2.4
libvirt:1.2.17-13.el7_2.5

on both the baremetal and the compute VM.

Please, post

1) # virsh dumpxml  VM-L1  ( where on L1 level you expect nested KVM to appear)

2) Login into VM-L1 and run :-

    # lsmod | grep kvm

3) I need outputs from VM-L1 ( in case it is Compute Node )

# cat /etc/nova/nova.conf | grep virt_type

# cat /etc/nova/nova.conf | grep  cpu_mode

Boris.

The only workaround now is to shutdown the compute VM and start it back from baremetal with virsh start.
A simple restart of the compute node doesn't help. It looks like the qemu-kvm process corresponding to the compute VM is the problem.

Laurentiu

În dum., 14 aug. 2016 la 00:19, Laurentiu Soica <laurentiu@xxxxxxxx> a scris:

Hello, 

I have an OpenStack setup in virtual environment on CentOS 7.

The baremetal has nested KVM enabled and 1 compute node as a VM.

Inside the compute node I have multiple VMs running.

After about every 3 days the VMs get inaccessible and the compute node reports high CPU usage. The qemu-kvm process for each VM inside the compute node reports full CPU usage.

Please help me with some hints to debug this issue.

Thanks,
Laurentiu

_______________________________________________

CentOS-virt mailing list

CentOS-virt@xxxxxxxxxx

https://lists.centos.org/mailman/listinfo/centos-virt

_______________________________________________

CentOS-virt mailing list

CentOS-virt@xxxxxxxxxx

https://lists.centos.org/mailman/listinfo/centos-virt

_______________________________________________

CentOS-virt mailing list

CentOS-virt@xxxxxxxxxx

https://lists.centos.org/mailman/listinfo/centos-virt

_______________________________________________

CentOS-virt mailing list

CentOS-virt@xxxxxxxxxx

https://lists.centos.org/mailman/listinfo/centos-virt

_______________________________________________

CentOS-virt mailing list

CentOS-virt@xxxxxxxxxx

https://lists.centos.org/mailman/listinfo/centos-virt

_______________________________________________
CentOS-virt mailing list
CentOS-virt@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos-virt