Re: BUG: soft lockup in __kmalloc_node() with KFENCE enabled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 10 Oct 2021 at 15:53, Andrea Righi <andrea.righi@xxxxxxxxxxxxx> wrote:
> I can systematically reproduce the following soft lockup w/ the latest
> 5.15-rc4 kernel (and all the 5.14, 5.13 and 5.12 kernels that I've
> tested so far).
>
> I've found this issue by running systemd autopkgtest (I'm using the
> latest systemd in Ubuntu - 248.3-1ubuntu7 - but it should happen with
> any recent version of systemd).
>
> I'm running this test inside a local KVM instance and apparently systemd
> is starting up its own KVM instances to run its tests, so the context is
> a nested KVM scenario (even if I don't think the nested KVM part really
> matters).
>
> Here's the oops:
>
> [   36.466565] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [udevadm:333]
> [   36.466565] Modules linked in: btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear psmouse floppy
> [   36.466565] CPU: 0 PID: 333 Comm: udevadm Not tainted 5.15-rc4
> [   36.466565] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[...]
>
> If I disable CONFIG_KFENCE the soft lockup doesn't happen and systemd
> autotest completes just fine.
>
> We've decided to disable KFENCE in the latest Ubuntu Impish kernel
> (5.13) for now, because of this issue, but I'm still investigating
> trying to better understand the problem.
>
> Any hint / suggestion?

Can you confirm this is not a QEMU TCG instance? There's been a known
issue with it: https://bugs.launchpad.net/qemu/+bug/1920934

One thing that I've been wondering is, if we can make
CONFIG_KFENCE_STATIC_KEYS=n the default, because the static keys
approach is becoming more trouble than it's worth. It requires us to
re-benchmark the defaults. If you're thinking of turning KFENCE on by
default (i.e. CONFIG_KFENCE_SAMPLE_INTERVAL non-zero), you could make
this decision for Ubuntu with whatever sample interval you choose.
We've found that for large deployments 500ms or above is more than
adequate.

Thanks,
-- Marco




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux