Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2017-11-27 02:38 PM, David Hill wrote:


On 2017-11-26 10:44 PM, Jason Wang wrote:


On 2017年11月25日 00:22, David Hill wrote:
The VMs all have 2 vNICs ... and this is the hypervisor:

[root@zappa ~]# brctl show
bridge name    bridge id        STP enabled    interfaces
virbr0        8000.525400914858    yes        virbr0-nic
                            vnet0
                            vnet1


1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 84:2b:2b:13:f2:91 brd ff:ff:ff:ff:ff:ff
    inet redacted/24 brd 173.178.138.255 scope global dynamic eno1
       valid_lft 48749sec preferred_lft 48749sec
    inet6 fe80::862b:2bff:fe13:f291/64 scope link
       valid_lft forever preferred_lft forever
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 84:2b:2b:13:f2:92 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.3/24 brd 192.168.1.255 scope global eno2
       valid_lft forever preferred_lft forever
    inet6 fe80::862b:2bff:fe13:f292/64 scope link
       valid_lft forever preferred_lft forever
4: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:91:48:58 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0

       valid_lft forever preferred_lft forever
    inet 192.168.122.10/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.11/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.12/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.15/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.16/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.17/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.18/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.31/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.32/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.33/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.34/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.35/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.36/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.37/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.45/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.46/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.47/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.48/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.49/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.50/32 scope global virbr0
       valid_lft forever preferred_lft forever
    inet 192.168.122.51/32 scope global virbr0
       valid_lft forever preferred_lft forever
5: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master virbr0 state DOWN group default qlen 1000
    link/ether 52:54:00:91:48:58 brd ff:ff:ff:ff:ff:ff
125: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1360 qdisc fq_codel state UNKNOWN group default qlen 100
    link/none
    inet 10.10.122.28/21 brd 10.10.127.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::1f9b:bfd4:e9c9:2059/64 scope link stable-privacy
       valid_lft forever preferred_lft forever
402: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master virbr0 state UNKNOWN group default qlen 1000
    link/ether fe:54:00:09:27:39 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:fe09:2739/64 scope link
       valid_lft forever preferred_lft forever
403: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master virbr0 state UNKNOWN group default qlen 1000
    link/ether fe:54:00:ea:6b:18 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:feea:6b18/64 scope link
       valid_lft forever preferred_lft forever


I could not reproduce this locally by simply running netperf through a mlx4 card. Some more questions:

- What kind of workloads did you run in guest?
- Did you meet this issue in a specific type of network card (I guess broadcom is used in this case)? - Virbr0 looks like a bridge created by libvirt that did NAT and other stuffs, can you still hit this issue if you don't use virbr0?

And what's more important, zerocopy is known to have issues, for production environment, need to disable it through vhost_net module parameters.

Thanks

I'm deploying an overcloud through a undercloud virtual machine... The VM has 4vCPUs and 16GB of RAM as well as to virtio nics so I'm using only virtual hardware here. I spawn 7 VMs on the hypervisor and deploy an overcloud using tripleo on them ... everything's virtual and if I remove the bridge, then I'll have to configure each VMs differently. The load is quite high on the VM that won't shutdown but when I shut it down, it's doing nothing ...   This is a hard bug to troubleshoot and I can't bisect the kernel because at some
point the system simply won't boot properly.

I've disabled zerocopy with the following:

[root@zappa modprobe.d]# cat vhost-net.conf
options vhost_net  experimental_zcopytx=0


And I haven't reproduce this issue so far.   The problem I have right now is that experimental_zcopytx has been enabled by default with this commit:

commit f9611c43ab0ddaf547b395c90fb842f55959334c
Author: Michael S. Tsirkin <mst@xxxxxxxxxx>
Date:   Thu Dec 6 14:56:00 2012 +0200

    vhost-net: enable zerocopy tx by default

    Zero copy TX has been around for a while now.
    We seem to be down to eliminating theoretical bugs
    and performance tuning at this point:
    it's probably time to enable it by default so that
    most users get the benefit.

    Keep the flag around meanwhile so users can experiment
    with disabling this if they experience regressions.
    I expect that we will remove it in the future.

    Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>

I'll try some more pass in producing this issue and I'll keep you posted.

Thank you very much,

David Hill




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux