lk 3.17-rc4 blk_mq large write problems

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Tue, 09 Sep 2014 23:55:02 -0400

A few days ago I was trying to create a large file
(say 16 GB) of zeros on an ext4 file system:
   dd if=/dev/zero bs=64k count=256k of=zero_16g.bin

After about 5 seconds there was a NULL de-reference that
crashed the machine (shown below). This was with a clean
version of lk 3.17-rc4 (from kernel.org) where the target
was a SATA SSD directly connected to a LSI 9300-4i SAS-3
HBA (mpt3sas). Significantly (IMO) the kernel boot line
contained:
   scsi_mod.use_blk_mq=Y

In all cases changing that to "N" fixed the problem. I tried
many things, including a SAS SSD but the problem persisted when
use_blk_mq=Y. It doesn't always oops as shown in the first
case below. There were also:
  - immediate reboots
  - lock-ups without any oops on the console
  - different oopses of a somewhat stranger nature
    (hard to catch as logging everything on a real
     serial port is fiddly) like double bus errors

Rob Elliott has been unable to replicate this problem.

Today I switched to another machine running Debian 7 (the
first machine was Ubuntu 14.04 based); both x86_64.
Built the same kernel on the second machine, this time
with a LSI 9212-4i4e SAS-2 HBA (mpt2sas) and a SAS SSD
directly connected. Roughly speaking it was the same
test case:
  # <create 1 partition on say /dev/sdb>
  # mkfs.ext4 /dev/sdb1
  # mount /dev/sdb1 /mnt/spare
  # cd /mnt/spare
  # dd if=/dev/zero bs=64k count=256k of=zero_16g.bin
  # cd
  # umount /mnt/spare

Usually the dd or the umount would crash. Then after a
crash, following a power cycle, the mount would crash.
Changing to scsi_mod.use_blk_mq=N restored sanity.

Tried some other SAS controllers: couldn't get a MR-9240-4i
(MegaRaid) to work at all on my newer box (doesn't like
PCIe 3 ?). Got a ARC-1882I working and it did not have
problems with the big dd (perhaps the arcmsr driver still
uses the host_lock to serialize commands).

So it could be common, bad code in the mpt2sas and mpt3sas
drivers. Or it could be somewhere else. Perhaps there is
more than one problem.

Testers out there are encouraged to run the above test case.
The SATA and SAS SSDs that I used can consume writes in the
300 to 600 MB/sec range.

Part of the strangeness of this first attached oops is that
blk_mq_timeout_check() appears twice. The second one (typically
from the umount) is a blown stack.

Enjoy.
Doug Gilbert

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0
PGD 2149ec067 PUD 214265067 PMD 0 
Oops: 0000 [#1] SMP 
Modules linked in: x86_pkg_temp_thermal kvm_intel kvm nfsd ehci_pci ehci_hcd crct10dif_pclmul serio_raw parport_pc auth_rpcgss oid_registry exportfs nfs lockd sunrpc binfmt_misc fuse lp parport ext4 crc16 jbd2 usbhid ses xhci_hcd r8169 usbcore usb_common
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.17.0-rc3 #69
Hardware name: Gigabyte Technology Co., Ltd. Z97M-D3H/Z97M-D3H, BIOS F5 05/30/2014
task: ffff88021513e090 ti: ffff88021518c000 task.ti: ffff88021518c000
RIP: 0010:[<ffffffff8127cd2e>]  [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0
RSP: 0018:ffff88021fb83e10  EFLAGS: 00010282
RAX: ffffffff8127cd20 RBX: 0000000000000000 RCX: ffff8800d3dc8d40
RDX: ffff88020fe9c0c8 RSI: 0000000000002007 RDI: ffff8800d3dc8c00
RBP: ffff88020fe9c0c8 R08: ffff880037970088 R09: ffff880037970000
R10: ffff88021e8024e8 R11: 0000000000000002 R12: 0000000000000449
R13: ffff880037970000 R14: ffff88021fb83ea8 R15: ffff88021520c000
FS:  0000000000000000(0000) GS:ffff88021fb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000214321000 CR4: 00000000001407e0
Stack:
 ffff8800d3dc8c00 ffff88020fe9c0c8 ffffffff8118f1d7 00000000000026fb
 ffff88020fe9d400 ffffffff811905db ffff880214fb33c0 ffff880037970000
 ffffffff81190570 ffff88021fb83ea8 0000000000000020 ffffffff81193430
Call Trace:
 <IRQ> 
 [<ffffffff8118f1d7>] ? blk_rq_timed_out+0x17/0x80
 [<ffffffff811905db>] ? blk_mq_timeout_check+0x6b/0x90
 [<ffffffff81190570>] ? blk_mq_attempt_merge+0xb0/0xb0
 [<ffffffff81193430>] ? blk_mq_tag_busy_iter+0x50/0x80
 [<ffffffff81190684>] ? blk_mq_rq_timer+0x84/0x120
 [<ffffffff81190600>] ? blk_mq_timeout_check+0x90/0x90
 [<ffffffff81076ea2>] ? call_timer_fn.isra.36+0x12/0x70
 [<ffffffff8107709a>] ? run_timer_softirq+0x19a/0x230
 [<ffffffff8103d6e5>] ? __do_softirq+0xd5/0x1f0
 [<ffffffff8103d995>] ? irq_exit+0x45/0x50
 [<ffffffff8102a6bb>] ? smp_apic_timer_interrupt+0x3b/0x50
 [<ffffffff8140dc4a>] ? apic_timer_interrupt+0x6a/0x70
 <EOI> 
 [<ffffffff81327deb>] ? cpuidle_enter_state+0x4b/0xc0
 [<ffffffff81327ddd>] ? cpuidle_enter_state+0x3d/0xc0
 [<ffffffff810655e7>] ? cpu_startup_entry+0x237/0x270
Code: e8 d8 b3 ff ff 85 c0 75 cd e9 54 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 be 07 20 00 00 53 48 8b 9f f8 00 00 00 <48> 8b 03 48 89 df 48 8b 28 e8 24 ac ff ff 83 bd 54 01 00 00 ff 
RIP  [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0
 RSP <ffff88021fb83e10>
CR2: 0000000000000000
---[ end trace 659752a390e3d62e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt
BUG: unable to handle kernel paging request at 000000017f6b91a0
IP: [<ffffffff8106ab1f>] cpuacct_charge+0x1f/0x40
PGD 3a77e067 PUD 0 
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP 
Modules linked in: fuse hfsplus hfs minix vfat msdos fat ext4 crc16 jbd2 nfsd auth_rpcgss oid_registry exportfs nfs lockd sunrpc usbhid ohci_pci ehci_pci ohci_hcd ehci_hcd parport_pc k8temp serio_raw parport usbcore usb_common sg mpt2sas sr_mod
CPU: 0 PID: 5005 Comm: mount Not tainted 3.17.0-rc4 #1
Hardware name: ASUSTek Computer INC. K8N-LR/K8N-LR, BIOS 0303 04/14/2006
task: ffff88003d354790 ti: ffff88003d338000 task.ti: ffff88003d338000
RIP: 0010:[<ffffffff8106ab1f>]  [<ffffffff8106ab1f>] cpuacct_charge+0x1f/0x40
RSP: 0018:ffff88003fc03e00  EFLAGS: 00010046
RAX: 000000000000cf28 RBX: ffff88003d3547f8 RCX: 000000003fc18c40
RDX: ffffffff815b5700 RSI: 0000000000047b86 RDI: ffff88003d354790
RBP: 0000000000047b86 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003fc117a0
R13: 00000004842567a4 R14: ffff88003d3547f8 R15: 00000016194179b7
FS:  00007f85e76367e0(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000017f6b91a0 CR3: 0000000033d75000 CR4: 00000000000007f0
Stack:
 ffffffff8105fa1c ffff88003a71c170 ffff88003fc117a0 0000000000000000
 0000000000000000 ffff88003fc11740 ffffffff810618d5 00000000000007fe
 ffffffff8105e905 ffff88003fc12100 ffff88003fc11740 0000000000000000
Call Trace:
 <IRQ> 
 [<ffffffff8105fa1c>] ? update_curr+0x9c/0xf0
 [<ffffffff810618d5>] ? task_tick_fair+0x1f5/0x4c0
 [<ffffffff8105e905>] ? sched_clock_local+0x15/0x80
 [<ffffffff8105a614>] ? scheduler_tick+0x64/0xe0
 [<ffffffff8107cd48>] ? update_process_times+0x58/0x80
 [<ffffffff81089f6d>] ? tick_sched_timer+0x4d/0x150
 [<ffffffff8107d2d9>] ? __run_hrtimer.isra.35+0x49/0xd0
 [<ffffffff8107d907>] ? hrtimer_interrupt+0xf7/0x240
 [<ffffffff8102c2f6>] ? smp_apic_timer_interrupt+0x36/0x50
 [<ffffffff8142374a>] ? apic_timer_interrupt+0x6a/0x70
 <EOI> 
Code: 48 c7 c0 f4 ff ff ff 5b eb d9 66 90 48 8b 47 08 48 63 48 18 48 8b 87 88 06 00 00 48 8b 50 60 0f 1f 44 00 00 48 8b 82 a8 00 00 00 <48> 03 04 cd a0 2f 5f 81 48 01 30 48 8b 52 40 48 85 d2 75 e5 c3 
RIP  [<ffffffff8106ab1f>] cpuacct_charge+0x1f/0x40
 RSP <ffff88003fc03e00>
CR2: 000000017f6b91a0
---[ end trace 18c8bb81a9313bee ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt