On Wed, 23 Jan 2019 01:16:03 +0200 Alex K <blessnown@xxxxxxxxx> wrote: > Hello David, > > Problem is our host is hangs for good and only ipmi reset can help in this case. We tried > several kernels (upstream, mainline, old stable) even tested other distributions to no > avail. Do you know if I can look at anything particular in debug logs to more closely > identify problem? Being a server system, you can probably get a serial console which is much better for debugging these sorts of things. It's not clear if your screen shots are a succession from a single error or multiple instances of the error. The first shows hard lockups followed by softlockups. Softlocks are not particularly surprising after a hardlock. There are a few kernel options that might be helpful for debugging: hardlockup_all_cpu_backtrace= [KNL] Should the hard-lockup detector generate backtraces on all cpus. Format: <integer> nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels Format: [panic,][nopanic,][num] Valid num: 0 or 1 0 - turn hardlockup detector in nmi_watchdog off 1 - turn hardlockup detector in nmi_watchdog on When panic is specified, panic when an NMI watchdog timeout occurs (or 'nopanic' to override the opposite default). To disable both hard and soft lockup detectors, please see 'nowatchdog'. This is useful when you use a panic=... timeout and need the box quickly up again. These settings can be accessed at runtime via the nmi_watchdog and hardlockup_panic sysctls. softlockup_panic= [KNL] Should the soft-lockup detector generate panics. Format: <integer> A nonzero value instructs the soft-lockup detector to panic the machine when a soft-lockup occurs. This is also controlled by CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC which is the respective build-time switch to that functionality. softlockup_all_cpu_backtrace= [KNL] Should the soft-lockup detector generate backtraces on all cpus. Format: <integer> panic= [KNL] Kernel behaviour on panic: delay <timeout> timeout > 0: seconds before rebooting timeout = 0: wait forever timeout < 0: reboot immediately Format: <timeout> I'd probably start with: nmi_watchdog=panic panic=5 So that initial hard lockup should generate a panic and reboot after 5 seconds. You'll really want a serial console to capture that panic. You'll probably also want to specify exactly which kernel you're using, and being an upstream list, the latest upstream kernel is the most relevant. Thanks, Alex > On Sun, 20 Jan 2019 10:50:35 -0500 > David Hill <dhill@xxxxxxxxxx> wrote: > > > I'm having a similar issue these days... host machine networking hangs > > for a little while when shutting down guests but connectivity resumes > > once the guests are down. > > > > On 2019-01-10 7:11 a.m., Alex K wrote: > > > Good day! > > > > > > We are using Ubuntu + KVM (via qemu and libvirt) and we've run into some difficulties. > > > Host OS: > > > Distributor ID: Ubuntu > > > Description: Ubuntu 18.04.1 LTS > > > Release: 18.04 > > > Codename: bionic > > > Spec server: > > > Supermicro SYS-5017GR-TF > > > CPU 1*Xeon E5-2690v2 > > > RAM 6*8GB ECC > > > SSD 2*1TB Samsung EVO860 > > > GPU 2*GTX1070 > > > Network HP523SFP 10G > > > PSU 1200W > > > VM guest: > > > vCPU 8 threads > > > RAM 16GB > > > GPU 1*GTX1070 (using vfio-pci) > > > netowrking - vepa with macvtap > > > > > > When the guest is receiving 'virsh destroy $vmname' command, the host machine hangs with > > > "NMI Watchdog BUG: soft lockup - CPU # 2 stuck for 22s". > > > > > > How we tried to fix this issue: > > > > > > Update bios to latest version > > > Updating to previous distro version, using alternatives (centos, debian) > > > Replace hardware components (CPU, RAM, Network Card, HDD and SSD) > > > Change some BIOS settings: > > > > > > Tried adjusting: > > > PERR# Generation > > > SERR# Generation > > > Above 4G Decoding > > > Cpu max performance > > > Disabling internal NIC > > > Removing excessive boot devices > > > Power technology: various parameters > > > enegry/performace bias: various parameters > > > Pcie port: gen2 and gen3. > > > > > > Problems occur only when guests are going down and new ones are created. > > > > > > Attached files (console screenshots) > > > > > > https://drive.google.com/open?id=1AHCabWEy88A9pBK26D40LXFMWZMehejX > > > https://drive.google.com/open?id=17VvCPVSWOJjEm-7nF9mrGNmd7vYNtR9x > > > https://drive.google.com/open?id=1RBkBcCnTCRLGDwf59Lh8s0yeGvCiYhw7 > > > > > > > > > > > > Regards, Alex.