Re: LIO crashing Fedora box, multiple versions and kernels tested

Dan Lane <dracodan@xxxxxxxxx> · Fri, 10 Apr 2015 00:08:44 -0400

I'm starting to get a better idea of the trigger is for my problem...
Something during startup/shutdown of ESXi is doing something that
either causes or triggers the failure.  I had ran the storage for over
a day with a VM working flawlessly (it was using a single hard drive
as the back-end after my ramdisk wasn't big enough and didn't cause a
crash.  When I had presented the LUN the host was already running, and
everyone played happy.  but once I tried shutting the host down, the
storage server crashed... I was able to get MOST of the error message
as I was running a tailf of messages from ssh, but it's abruptly cut
off.  I plan to do further testing to see if I can figure out exactly
what is triggering the failure and get better logs (I'm open to
suggestions on this!).

Error message:
Apr 10 00:00:19 labsan2 kernel: [90003.576341] ------------[ cut here
]------------
Apr 10 00:00:19 labsan2 kernel: [90003.576341] WARNING: CPU: 1 PID: 0
at kernel/watchdog.c:317 watchdog_overflow_callback+0x82/0xc0()
Apr 10 00:00:19 labsan2 kernel: [90003.576341] Watchdog detected hard
LOCKUP on cpu 1
Apr 10 00:00:19 labsan2 kernel: [90003.576341] Modules linked in:
tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file
target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter
ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw coretemp gpio_ich iTCO_wdt
iTCO_vendor_support ipmi_devintf kvm_intel kvm lpc_ich mfd_core
i5000_edac edac_core ipmi_ssif serio_raw i5k_amb ipmi_si
ipmi_msghandler ioatdma shpchp dca acpi_cpufreq nfsd auth_rpcgss
nfs_acl lockd grace sunrpc xfs libcrc32c radeon i2c_algo_bit
drm_kms_helper mptsas ttm scsi_transport_sas drm mptscsih qla2xxx bnx2
mptbase usb_storage ata_generic pata_acpi scsi_transport_fc
Apr 10 00:00:19 labsan2 kernel: [90003.576341] CPU: 1 PID: 0 Comm:
swapper/1 Not tainted 4.0.0-0.rc2.git0.1.fc22.x86_64 #1
Apr 10 00:00:19 labsan2 kernel: [90003.576341] Hardware name: IBM IBM
eServer BladeCenter HS21 -[8853L6U]-/Server Blade, BIOS
-[BCE142BUS-1.18]- 06/17/2009
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
0f19f9d994cd1b0a ffff88042fc85a60 ffffffff81780388
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
ffff88042fc85ab8 ffff88042fc85aa0 ffffffff8109c83a
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
ffff88041d6d0000 0000000000000000 ffff88042fc85c00
Apr 10 00:00:19 labsan2 kernel: [90003.576341] Call Trace:
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  <NMI>
[<ffffffff81780388>] dump_stack+0x45/0x57
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8109c83a>]
warn_slowpath_common+0x8a/0xc0
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8109c8c5>]
warn_slowpath_fmt+0x55/0x70
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81153e92>]
watchdog_overflow_callback+0x82/0xc0
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8119599b>]
__perf_event_overflow+0x9b/0x250
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff811964b4>]
perf_event_overflow+0x14/0x20
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81034f22>]
intel_pmu_handle_irq+0x1d2/0x3e0
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8102c1db>]
perf_event_nmi_handler+0x2b/0x50
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81019148>]
nmi_handle+0x88/0x130
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff810196c2>]
default_do_nmi+0x42/0x110
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81019818>]
do_nmi+0x88/0xd0
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81788b01>]
end_repeat_nmi+0x1e/0x2e
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
_raw_spin_lock_irqsave+0x55/0x60
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
_raw_spin_lock_irqsave+0x55/0x60
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
_raw_spin_lock_irqsave+0x55/0x60
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  <<EOE>>  <IRQ>
[<ffffffffa00ffeb2>] qlt_fc_port_deleted+0x62/0xd0 [qla2xxx]
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00a2c43>]
qla2x00_mark_device_lost+0x153/0x2e0 [qla2xxx]
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00c2339>]
qla2x00_async_event+0xe19/0x1870 [qla2xxx]
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00c3901>]
qla24xx_intr_handler+0x1a1/0x2f0 [qla2xxx]
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff810f4c57>]
handle_irq_event_percpu+0x77/0x1a0
Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8

On Wed, Apr 8, 2015 at 12:36 AM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Tue, 2015-04-07 at 23:17 -0400, Dan Lane wrote:
>> I presented a new LUN (and no others) with emulate-caw set to 0 via
>> the following command:
>> echo 0 > /sys/kernel/config/target/core/iblock_1/test/attrib/emulate_caw
>>
>
> Ok, after a quick test run on my end, a v4.0-rc1 + QLA2532 <-> QLA2532
> configuration is able to survive ~3000 COMPARE_AND_WRITE miscompares
> generated by sg_compare_and_write to a RAMDISK_MCP.
>
> So I'm pretty certain that it's not a COMPARE_AND_WRITE specific issue.
>
>> Upon starting the first ESXi host after configuring the new LUN, I
>> started getting the following on the console every few seconds:
>>
>> Message from syslogd@labsan2 at Apr  7 23:00:29 ...
>>  kernel:[17424.117000] NMI watchdog: BUG: soft lockup - CPU#2 stuck
>> for 22s! [systemd-udevd:1782]
>>
>> after a few minutes of this (and while trying to get more logs) I
>> started getting the following call trace on the console:
>>
>> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff811d3d03>] handle_mm_fault+0xe43/0x1850
>> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff812686ae>] ? ep_scan_ready_list+0x20e/0x220
>> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff81268930>] ? ep_poll+0x240/0x440
>> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff81064d13>] __do_page_fault+0x193/0x450
>> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff81065001>] do_page_fault+0x31/0x70
>> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff817887c8>] page_fault+0x28/0x30
>> Apr  7 23:11:25 labsan2 kernel: [18080.117000] Code: 00 48 03 14 c5 c0 ff d0 81 48 89 de 48 89 df e8 6d 69 29 00 84 c0 74 13 44 89 e7 ff 15 50 9e b0 00 eb 08 66 0f 1f 44 00 00 f3 90 <f6> 43 18 01 75 f8 31 c0 48 8b 7d e8 65 48 33 3c 25 28 00 00 00
>> Apr  7 23:11:26 labsan2 kernel: [18081.031504] ABORT_TASK: Found referenced qla2xxx task_tag: 1139688
>> Apr  7 23:11:26 labsan2 kernel: [18081.031506] ABORT_TASK: ref_tag: 1139688 already complete, skipping
>> Apr  7 23:11:26 labsan2 kernel: [18081.031508] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1139688
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-udevd:1782]
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] Modules linked in:
>> tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iTCO_wdt iTCO_vendor_support gpio_ich coretemp lpc_ich mfd_core kvm_intel kvm ipmi_devintf ipmi_ssif serio_raw i5000_edac ioatdma dca shpchp edac_core ipmi_si i5k_amb ipmi_msghandler acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c radeon i2c_algo_bit drm_kms_helper qla2xxx ttm mptsas drm scsi_transport_sas mptscsih ata_generic pata_acpi mptbase bnx2 scsi_transport_fc
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] CPU: 2 PID: 1782 Comm: systemd-udevd Tainted: G        W    L  4.0.0-0.rc2.git0.1.fc22.x86_64 #1
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] Hardware name: IBM IBM eServer BladeCenter HS21 -[8853L6U]-/Server Blade, BIOS -[BCE142BUS-1.18]- 06/17/2009
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] task: ffff8800c9f84f00 ti: ffff88012a344000 task.ti: ffff88012a344000
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RIP: 0010:[<ffffffff8111b5aa>]  [<ffffffff8111b5aa>] generic_exec_single+0xea/0x1a0
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RSP: 0000:ffff88012a347be8  EFLAGS: 00000202
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RAX: 00000000000008fb RBX: 0000000000000141 RCX: 0000000000000000
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RDX: 00000000000008fb RSI: 00000000000000fb RDI: 0000000000000282
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RBP: ffff88012a347c28 R08: 0000000000000001 R09: ffff880000000e18
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] R10: ffff880000000000 R11: 0000000000130000 R12: 000200da00000001
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] R13: 0000000000000000 R14: ffff88012ffe8d80 R15: 0000010000000000
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] FS: 00007f41505c5880(0000) GS:ffff88012fd00000(0000) knlGS:0000000000000000
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] CR2: 00007ffcc75c3fa8 CR3: 000000012a324000 CR4: 00000000000007e0
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] Stack:
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  ffff88012a347c08 0000000000000000 ffffffff8106be30 ffff88012a347cb8
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  0000000000000003 00000000cfa942cb 0000000000000001 ffffffff8106be30
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  ffff88012a347c58 ffffffff8111b6c0 ffff88012a347c58 ffffffff8139ec2b
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000] Call Trace:
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106be30>] ? do_flush_tlb_all+0x50/0x50
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106be30>] ? do_flush_tlb_all+0x50/0x50
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8111b6c0>] smp_call_function_single+0x60/0xa0
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8139ec2b>] ? cpumask_next_and+0x3b/0x50
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106be30>] ? do_flush_tlb_all+0x50/0x50
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8111bce3>] smp_call_function_many+0x233/0x280
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106c048>] native_flush_tlb_others+0xb8/0xc0
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106c2e4>] flush_tlb_page+0x54/0x90
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff811e556e>] ptep_clear_flush+0x4e/0x60
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff811d2333>] do_wp_page+0x333/0x940
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff811d3d03>] handle_mm_fault+0xe43/0x1850
>> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff812686ae>] ? ep_scan_ready_list+0x20e/0x220
>>
>
> Mmm, without the complete log it's hard to tell what's actually going
> on.
>
> So at this point I'm leaning towards the initial trigger being that your
> backend LLD driver is not returning I/Os before ESX's ABORT_TASK timer
> fires.  IIRC, ESX's qla2xxx timeout is 20 seconds.
>
> This typically means that something outside of the target is happening,
> that results in outstanding LLD I/Os not being completed fast enough, or
> not being completed at all back to the target backend driver. (IBLOCK in
> your case)
>
> It's a bit unusual that you've got two different machines with two
> different HBAs (aacraid + mptsas..?) displaying similar behavior..
>
> So I'd recommend first trying to eliminate the LLDs driver(s) in
> question, and verify using a ramdisk that's large enough to hold a
> working VMFS and see you are still able to reproduce.
>
> From there, I'd make sure your using the very latest firmware on the
> HBAs in question, and re-verify with the 'supported' vendor drivers from
> LSI and adaptec.
>
> --nab
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html