Re: BUG: soft lockup - CPU#0 stuck for 26s! with nested KVM on 5.19.x

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 9/8/22 17:08, Sean Christopherson wrote:
On Thu, Sep 08, 2022, František Šumšal wrote:
On 9/7/22 17:23, Sean Christopherson wrote:
On Wed, Sep 07, 2022, František Šumšal wrote:
On 9/7/22 17:08, Sean Christopherson wrote:
On Wed, Sep 07, 2022, František Šumšal wrote:
Hello!

In our Arch Linux part of the upstream systemd CI I recently noticed an
uptrend in CPU soft lockups when running one of our tests. This test runs
several systemd-nspawn containers in succession and sometimes the underlying
VM locks up due to a CPU soft lockup

By "underlying VM", do you mean L1 or L2?  Where

       L0 == Bare Metal
       L1 == Arch Linux (KVM, 5.19.5-arch1-1/5.19.7-arch1-1)
       L2 == Arch Linux (nested KVM or QEMU TCG, 5.19.5-arch1-1/5.19.7-arch1-1)

I mean L2.

Is there anything interesting in the L1 or L0 logs?  A failure in a lower level
can manifest as a soft lockup and/or stall in the VM, e.g. because a vCPU isn't
run by the host for whatever reason.

There's nothing (quite literally) in the L0 logs, the host is silent when running the VM/tests.
As for L1, there doesn't seem to be anything interesting as well. Here are the L1 and L2 logs
for reference: https://mrc0mmand.fedorapeople.org/kernel-kvm-soft-lockup/2022-09-07-logs/


Does the bug repro with an older version of QEMU?  If it's difficult to roll back
the QEMU version, then we can punt on this question for now.


Is it possible to run the nspawn tests in L1?  If the bug repros there, that would
greatly shrink the size of the haystack.

I've fiddled around with the test and managed to trim it down enough so it's easy to run in both
L1 and L2, and after couple of hours I managed to reproduce it in both layers. That also somewhat
answers the QEMU question, since L0 uses QEMU 6.2.0 to run L1, and L1 uses QEMU 7.0.0 to run L2.
In both cases I used TCG emulation, since with it the issue appears to reproduce slightly more
often (or maybe I was just unlucky with KVM).

https://mrc0mmand.fedorapeople.org/kernel-kvm-soft-lockup/2022-09-07-logs-no-L2/L1_console.log

As in the previous case, there's nothing of interest in the L0 logs.

This also raises a question - is this issue still KVM-related, since in the last case there's
just L0 baremetal and L1 QEMU/TCG without KVM involved?

Ya, unless there's a latent bug in KVM that's present in your L0 kernel, which is
extremely unlikely given that the bug repros with 4.18 and 5.17 as the bare metal
kernel, this isn't a KVM problem.

The mm, ext4, and scheduler subsystems are all likely candidates.  I'm not familiar
enough with their gory details to point fingers though.

Do you think it's possible to bisect the L1 kernel using the QEMU/TCG configuration?
That'd probably be the least awful way to get a root cause.

Yeah, I can try, but it might take some time. Nevertheless, it sounds like the least awful way
how to debug this further, as you said. I'll report back if/when I find something interesting.

Thanks for the tips!

--
PGP Key ID: 0xFB738CE27B634E4B

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux