Re: BUG: soft lockup in __kmalloc_node() with KFENCE enabled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 11 Oct 2021 at 09:10, Andrea Righi <andrea.righi@xxxxxxxxxxxxx> wrote:
>
> On Mon, Oct 11, 2021 at 08:48:29AM +0200, Marco Elver wrote:
> > On Mon, 11 Oct 2021 at 08:32, Andrea Righi <andrea.righi@xxxxxxxxxxxxx> wrote:
> > > On Mon, Oct 11, 2021 at 08:00:00AM +0200, Marco Elver wrote:
> > > > On Sun, 10 Oct 2021 at 15:53, Andrea Righi <andrea.righi@xxxxxxxxxxxxx> wrote:
> > > > > I can systematically reproduce the following soft lockup w/ the latest
> > > > > 5.15-rc4 kernel (and all the 5.14, 5.13 and 5.12 kernels that I've
> > > > > tested so far).
> > > > >
> > > > > I've found this issue by running systemd autopkgtest (I'm using the
> > > > > latest systemd in Ubuntu - 248.3-1ubuntu7 - but it should happen with
> > > > > any recent version of systemd).
> > > > >
> > > > > I'm running this test inside a local KVM instance and apparently systemd
> > > > > is starting up its own KVM instances to run its tests, so the context is
> > > > > a nested KVM scenario (even if I don't think the nested KVM part really
> > > > > matters).
> > > > >
> > > > > Here's the oops:
> > > > >
> > > > > [   36.466565] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [udevadm:333]
> > > > > [   36.466565] Modules linked in: btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear psmouse floppy
> > > > > [   36.466565] CPU: 0 PID: 333 Comm: udevadm Not tainted 5.15-rc4
> > > > > [   36.466565] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> > > > [...]
> > > > >
> > > > > If I disable CONFIG_KFENCE the soft lockup doesn't happen and systemd
> > > > > autotest completes just fine.
> > > > >
> > > > > We've decided to disable KFENCE in the latest Ubuntu Impish kernel
> > > > > (5.13) for now, because of this issue, but I'm still investigating
> > > > > trying to better understand the problem.
> > > > >
> > > > > Any hint / suggestion?
> > > >
> > > > Can you confirm this is not a QEMU TCG instance? There's been a known
> > > > issue with it: https://bugs.launchpad.net/qemu/+bug/1920934
> > >
> > > It looks like systemd is running qemu-system-x86 without any "accel"
> > > options, so IIUC the instance shouldn't use TCG. Is this a correct
> > > assumption or is there a better way to check?
> >
> > AFAIK, the default is TCG if nothing else is requested. What was the
> > command line?
>
> This is the full command line of what systemd is running:
>
>   /bin/qemu-system-x86_64 -smp 4 -net none -m 512M -nographic -vga none -kernel /boot/vmlinuz-5.15-rc4 -drive format=raw,cache=unsafe,file=/var/tmp/systemd-test.sI1nrh/badid.img -initrd /boot/initrd.img-5.15-rc4 -append  root=/dev/sda1 rw raid=noautodetect rd.luks=0 loglevel=2 init=/lib/systemd/systemd console=ttyS0 selinux=0  SYSTEMD_UNIT_PATH=/usr/lib/systemd/tests/testdata/testsuite-14.units:/usr/lib/systemd/tests/testdata/units: systemd.unit=testsuite.target systemd.wants=testsuite-14.service systemd.wants=end.service
>
> And this is running inside a KVM instance (so a nested KVM scenario).

Hi Andrea,

I think you need to pass -enable-kvm to make it "nested KVM scenario",
otherwise it's TCG emulation.

You seem to use the default 20s stall timeout. FWIW syzbot uses 160
secs timeout for TCG emulation to avoid false positive warnings:
https://github.com/google/syzkaller/blob/838e7e2cd9228583ca33c49a39aea4d863d3e36d/dashboard/config/linux/upstream-arm64-kasan.config#L509
There are a number of other timeouts raised as well, some as high as
420 seconds.





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux