Re: "BUG: soft lockup" and frozen guest

Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> · Tue, 30 Apr 2019 12:49:15 +0200

Christopher Pereira <kripper@xxxxxxxxxxxx> writes:

> On April 29, 2019 7:56:44 AM AST, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> 
> wrote:
>
>     Christopher Pereira <kripper@xxxxxxxxxxxx> writes:
>
>         Hi, I have been experiencing some random guest crashes in the
>         last years and would like to invest some time in trying to debug
>         them with your help. Symptom is: 1) "BUG: soft lockup" & "CPU#*
>         stuck for *s!" messages during high load on the guest 2) At some
>         point later (eg. 12 hours later), the guest just hangs without
>         any message and must be destroyed / rebooted. I attached the
>         relevant kernel messages. Host (spec: Intel(R) Xeon(R) CPU
>         E5645) is running: kernel-3.10.0-327.el7.x86_64
>         libvirt-daemon-kvm-1.2.17-13.el7_2.5.x86_64
>         qemu-kvm-ev-2.3.0-31.el7_2.10.1.x86_64
>         qemu-kvm-common-ev-2.3.0-31.el7_2.10.1.x86_64 
>
>
>     This is pretty old stuff, e.g. kernel-3.10.0-327.el7 was release with
>     RHEL-7.2 (Nov 2015). As this is upstream mailing list, it would be great
>     if you could build an upstream kernel (should work with EL7 userspace)
>     and try to reproduce.
>
> Hi Vitaly,
>
> Yes, but it's a critical production environment and I haven't seen any 
> related patch in the kernel changelog since 3.10. We will try to upgrade 
> whenever possible.

It's hard to tell which changes may be related here (as, for example, I
also see nf_conntrack_* in your trace and the bug may as well be there)
but even in RHEL-7.2 updates (kernel-3.10.0-327.*) I can see several
dozed KVM commits (and there's several hundred between 7.2 and 7.6).

>
> I believe this bug could be related to overcommitting resources. Does 
> qemu-kvm throw any log message when resources are overcommited? Is there 
> some way to enable this?
>
> We have seen this happening one in a while in the last 4 years on 
> different production hardware and wanted to ask if this is a common 
> issue and how to address/debug this issue.

Define "resources" and "overcommit" ;-) In case you overcommit
CPUs/memory severily (like dozens/hundereds of vCPUs per pCPU, host
constantly swapping) guests may, of course, start to misbehave. In case
it is just a couple of vCPU per pCPU and the host is not swapping
guest softlockups are not normal.

In case there's no way to trigger the issue you may want to enable kdump
and set

sysctl -w kernel.softlockup_panic=1
sysctl -w kernel.softlockup_all_cpu_backtrace=1

and then inspect the crash dump.

-- 
Vitaly