On April 29, 2019 7:56:44 AM AST, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
wrote:
Christopher Pereira <kripper@xxxxxxxxxxxx> writes:
Hi, I have been experiencing some random guest crashes in the
last years and would like to invest some time in trying to debug
them with your help. Symptom is: 1) "BUG: soft lockup" & "CPU#*
stuck for *s!" messages during high load on the guest 2) At some
point later (eg. 12 hours later), the guest just hangs without
any message and must be destroyed / rebooted. I attached the
relevant kernel messages. Host (spec: Intel(R) Xeon(R) CPU
E5645) is running: kernel-3.10.0-327.el7.x86_64
libvirt-daemon-kvm-1.2.17-13.el7_2.5.x86_64
qemu-kvm-ev-2.3.0-31.el7_2.10.1.x86_64
qemu-kvm-common-ev-2.3.0-31.el7_2.10.1.x86_64
This is pretty old stuff, e.g. kernel-3.10.0-327.el7 was release with
RHEL-7.2 (Nov 2015). As this is upstream mailing list, it would be great
if you could build an upstream kernel (should work with EL7 userspace)
and try to reproduce.
Hi Vitaly,
Yes, but it's a critical production environment and I haven't seen any
related patch in the kernel changelog since 3.10. We will try to upgrade
whenever possible.
I believe this bug could be related to overcommitting resources. Does
qemu-kvm throw any log message when resources are overcommited? Is there
some way to enable this?
We have seen this happening one in a while in the last 4 years on
different production hardware and wanted to ask if this is a common
issue and how to address/debug this issue.
Best regards.