> On Jan 22, 2018, at 18:34, Jens Axboe <axboe@xxxxxxxxx> wrote: > > On 1/22/18 4:31 PM, David Zarzycki wrote: >> Hello, >> >> I previously reported a hang when building LLVM+clang on a block multi-queue device (NVMe _or_ loopback onto tmpfs with the ’none’ scheduler). >> >> I’ve since updated the kernel to 4.15-rc9, merged the ‘blkmq/for-next’ branch, disabled nohz_full parameter (used for testing), and tried again. Both NVMe and loopback now lock up hard (ext4 if it matters). Here are the backtraces: >> >> NVMe: http://znu.io/IMG_0366.jpg >> Loopback: http://znu.io/IMG_0367.jpg > > I tried to reproduce this today using the exact recipe that you provide, > but it ran fine for hours. Similar setup, nvme on a dual socket box > with 48 threads. Hi Jens, Thanks for the quick reply and thanks for trying to reproduce this. I’m not sure if this makes a difference, but this dual Skylake machine has 96 threads, not 48 threads. Also, just to be clear, NVMe doesn’t seem to matter. I hit this bug with a tmpfs loopback device set up like so: dd if=/dev/zero bs=1024k count=10000 of=/tmp/loopdisk losetup /dev/loop0 /tmp/loopdisk echo none > /sys/block/loop0/queue/scheduler mkfs -t ext4 -L loopy /dev/loop0 mount /dev/loop0 /l ### build LLVM+clang in /l ### 'ninja check-all’ in a loop in /l (No swap is setup because the machine has 192 GiB of RAM.) > >> What should I try next to help debug this? > > This one looks different than the other one. Are you sure your hw is sane? I can build LLVM+clang in /tmp (tmpfs) reliably which suggests the the fundamental hardware is sane. It’s only when the software multi-queue layer gets involved that I see quick crashes/hangs. As for the different backtraces, that's probably because I removed nohz_full from the kernel boot parameters. > I'd probably try and enable lockdep debugging etc and see if you catch anything. Thanks. I turned on lockdep plus other lock debugging. Here is the resulting backtrace: http://znu.io/IMG_0368.jpg Here is the resulting backtrace with transparent huge pages disabled: http://znu.io/IMG_0369.jpg Here is the resulting backtrace with transparent huge pages disabled AND with systemd-coredumps disabled too: http://znu.io/IMG_0370.jpg I’m open to trying anything at this point. Thanks for helping, Dave