Re: Hard LOCKUP on 4.15-rc9 + 'blkmq/for-next' branch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Jan 22, 2018, at 18:34, Jens Axboe <axboe@xxxxxxxxx> wrote:
> 
> On 1/22/18 4:31 PM, David Zarzycki wrote:
>> Hello,
>> 
>> I previously reported a hang when building LLVM+clang on a block multi-queue device (NVMe _or_ loopback onto tmpfs with the ’none’ scheduler).
>> 
>> I’ve since updated the kernel to 4.15-rc9, merged the ‘blkmq/for-next’ branch, disabled nohz_full parameter (used for testing), and tried again. Both NVMe and loopback now lock up hard (ext4 if it matters). Here are the backtraces:
>> 
>> NVMe:      http://znu.io/IMG_0366.jpg
>> Loopback:  http://znu.io/IMG_0367.jpg
> 
> I tried to reproduce this today using the exact recipe that you provide,
> but it ran fine for hours. Similar setup, nvme on a dual socket box
> with 48 threads.

Hi Jens,

Thanks for the quick reply and thanks for trying to reproduce this. I’m not sure if this makes a difference, but this dual Skylake machine has 96 threads, not 48 threads. Also, just to be clear, NVMe doesn’t seem to matter. I hit this bug with a tmpfs loopback device set up like so:

dd if=/dev/zero bs=1024k count=10000 of=/tmp/loopdisk
losetup /dev/loop0 /tmp/loopdisk
echo none > /sys/block/loop0/queue/scheduler
mkfs -t ext4 -L loopy /dev/loop0
mount /dev/loop0 /l
### build LLVM+clang in /l
### 'ninja check-all’ in a loop in /l

(No swap is setup because the machine has 192 GiB of RAM.)

> 
>> What should I try next to help debug this?
> 
> This one looks different than the other one. Are you sure your hw is sane?

I can build LLVM+clang in /tmp (tmpfs) reliably which suggests the the fundamental hardware is sane. It’s only when the software multi-queue layer gets involved that I see quick crashes/hangs.

As for the different backtraces, that's probably because I removed nohz_full from the kernel boot parameters.

> I'd probably try and enable lockdep debugging etc and see if you catch anything.

Thanks. I turned on lockdep plus other lock debugging. Here is the resulting backtrace:

http://znu.io/IMG_0368.jpg

Here is the resulting backtrace with transparent huge pages disabled:

http://znu.io/IMG_0369.jpg

Here is the resulting backtrace with transparent huge pages disabled AND with systemd-coredumps disabled too:

http://znu.io/IMG_0370.jpg

I’m open to trying anything at this point. Thanks for helping,
Dave



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux