Re: Guest shutdown hang host machine.

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 22 Jan 2019 17:01:09 -0700

On Wed, 23 Jan 2019 01:16:03 +0200
Alex K <blessnown@xxxxxxxxx> wrote:

> Hello David,
> 
> Problem is our host is hangs for good and only ipmi reset can help in this case. We tried
> several kernels (upstream, mainline, old stable) even tested other distributions to no
> avail. Do you know if I can look at anything particular in debug logs to more closely
> identify problem?

Being a server system, you can probably get a serial console which is
much better for debugging these sorts of things.  It's not clear if
your screen shots are a succession from a single error or multiple
instances of the error.  The first shows hard lockups followed by
softlockups.  Softlocks are not particularly surprising after a
hardlock.  There are a few kernel options that might be helpful for
debugging:

        hardlockup_all_cpu_backtrace=
                        [KNL] Should the hard-lockup detector generate
                        backtraces on all cpus.
                        Format: <integer>

        nmi_watchdog=   [KNL,BUGS=X86] Debugging features for SMP kernels
                        Format: [panic,][nopanic,][num]
                        Valid num: 0 or 1
                        0 - turn hardlockup detector in nmi_watchdog off
                        1 - turn hardlockup detector in nmi_watchdog on
                        When panic is specified, panic when an NMI watchdog
                        timeout occurs (or 'nopanic' to override the opposite
                        default). To disable both hard and soft lockup detectors,
                        please see 'nowatchdog'.
                        This is useful when you use a panic=... timeout and
                        need the box quickly up again.

                        These settings can be accessed at runtime via
                        the nmi_watchdog and hardlockup_panic sysctls.

        softlockup_panic=
                        [KNL] Should the soft-lockup detector generate panics.
                        Format: <integer>

                        A nonzero value instructs the soft-lockup detector
                        to panic the machine when a soft-lockup occurs. This
                        is also controlled by CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC
                        which is the respective build-time switch to that
                        functionality.

        softlockup_all_cpu_backtrace=
                        [KNL] Should the soft-lockup detector generate
                        backtraces on all cpus.
                        Format: <integer>

        panic=          [KNL] Kernel behaviour on panic: delay <timeout>
                        timeout > 0: seconds before rebooting
                        timeout = 0: wait forever
                        timeout < 0: reboot immediately
                        Format: <timeout>

I'd probably start with:

nmi_watchdog=panic panic=5

So that initial hard lockup should generate a panic and reboot after 5
seconds.  You'll really want a serial console to capture that panic.
You'll probably also want to specify exactly which kernel you're using,
and being an upstream list, the latest upstream kernel is the most
relevant.  Thanks,

Alex

> On Sun, 20 Jan 2019 10:50:35 -0500
> David Hill <dhill@xxxxxxxxxx> wrote:
> 
> > I'm having a similar issue these days... host machine networking hangs 
> > for a little while when shutting down guests but connectivity resumes 
> > once the guests are down.
> > 
> > On 2019-01-10 7:11 a.m., Alex K wrote:  
> > > Good day!
> > >
> > > We are using Ubuntu + KVM (via qemu and libvirt) and we've run into some difficulties.
> > > Host OS:
> > > Distributor ID: Ubuntu
> > > Description: Ubuntu 18.04.1 LTS
> > > Release: 18.04
> > > Codename: bionic
> > > Spec server:
> > > Supermicro SYS‌-5017GR-TF
> > > CPU 1‌*Xeon E5-2690v2
> > > RAM 6‌*8GB ECC
> > > SSD 2‌*1TB Samsung EVO860
> > > GPU 2‌*GTX1070
> > > Network HP523SFP 10G
> > > PSU 1200W
> > > VM guest:
> > > vCPU 8 threads
> > > RAM 16GB
> > > GPU 1*GTX1070 (using vfio-pci)
> > > netowrking - vepa with macvtap
> > >
> > > When the guest is receiving 'virsh destroy $vmname' command, the host machine hangs with
> > > "NMI Watchdog BUG: soft lockup - CPU # 2 stuck for 22s".
> > >
> > > How we tried to fix this issue:
> > >
> > >      Update bios to latest version
> > >      Updating to previous distro version, using alternatives (centos, debian)
> > >      Replace hardware components (CPU, RAM, Network Card, HDD and SSD)
> > >      Change some BIOS settings:
> > >
> > >      Tried adjusting:
> > >      PERR# Generation
> > >      SERR# Generation
> > >      Above 4G Decoding
> > >      Cpu max performance
> > >      Disabling internal NIC
> > >      Removing excessive boot devices
> > >      Power technology: various parameters
> > >      enegry/performace bias: various parameters
> > >      Pcie port: gen2 and gen3.
> > >
> > > Problems occur only when guests are going down and new ones are created.
> > >
> > > Attached files (console screenshots)
> > >
> > > https://drive.google.com/open?id=1AHCabWEy88A9pBK26D40LXFMWZMehejX
> > > https://drive.google.com/open?id=17VvCPVSWOJjEm-7nF9mrGNmd7vYNtR9x
> > > https://drive.google.com/open?id=1RBkBcCnTCRLGDwf59Lh8s0yeGvCiYhw7
> > >
> > >
> > >
> > > Regards, Alex.