Re: LIO crashing Fedora box, multiple versions and kernels tested

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 07 Apr 2015 21:36:05 -0700

On Tue, 2015-04-07 at 23:17 -0400, Dan Lane wrote:
> I presented a new LUN (and no others) with emulate-caw set to 0 via
> the following command:
> echo 0 > /sys/kernel/config/target/core/iblock_1/test/attrib/emulate_caw
> 

Ok, after a quick test run on my end, a v4.0-rc1 + QLA2532 <-> QLA2532
configuration is able to survive ~3000 COMPARE_AND_WRITE miscompares
generated by sg_compare_and_write to a RAMDISK_MCP.

So I'm pretty certain that it's not a COMPARE_AND_WRITE specific issue.

> Upon starting the first ESXi host after configuring the new LUN, I
> started getting the following on the console every few seconds:
> 
> Message from syslogd@labsan2 at Apr  7 23:00:29 ...
>  kernel:[17424.117000] NMI watchdog: BUG: soft lockup - CPU#2 stuck
> for 22s! [systemd-udevd:1782]
> 
> after a few minutes of this (and while trying to get more logs) I
> started getting the following call trace on the console:
> 
> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff811d3d03>] handle_mm_fault+0xe43/0x1850
> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff812686ae>] ? ep_scan_ready_list+0x20e/0x220
> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff81268930>] ? ep_poll+0x240/0x440
> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff81064d13>] __do_page_fault+0x193/0x450
> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff81065001>] do_page_fault+0x31/0x70
> Apr  7 23:11:25 labsan2 kernel: [18080.117000]  [<ffffffff817887c8>] page_fault+0x28/0x30
> Apr  7 23:11:25 labsan2 kernel: [18080.117000] Code: 00 48 03 14 c5 c0 ff d0 81 48 89 de 48 89 df e8 6d 69 29 00 84 c0 74 13 44 89 e7 ff 15 50 9e b0 00 eb 08 66 0f 1f 44 00 00 f3 90 <f6> 43 18 01 75 f8 31 c0 48 8b 7d e8 65 48 33 3c 25 28 00 00 00
> Apr  7 23:11:26 labsan2 kernel: [18081.031504] ABORT_TASK: Found referenced qla2xxx task_tag: 1139688
> Apr  7 23:11:26 labsan2 kernel: [18081.031506] ABORT_TASK: ref_tag: 1139688 already complete, skipping
> Apr  7 23:11:26 labsan2 kernel: [18081.031508] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1139688
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-udevd:1782]
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] Modules linked in:
> tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iTCO_wdt iTCO_vendor_support gpio_ich coretemp lpc_ich mfd_core kvm_intel kvm ipmi_devintf ipmi_ssif serio_raw i5000_edac ioatdma dca shpchp edac_core ipmi_si i5k_amb ipmi_msghandler acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c radeon i2c_algo_bit drm_kms_helper qla2xxx ttm mptsas drm scsi_transport_sas mptscsih ata_generic pata_acpi mptbase bnx2 scsi_transport_fc
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] CPU: 2 PID: 1782 Comm: systemd-udevd Tainted: G        W    L  4.0.0-0.rc2.git0.1.fc22.x86_64 #1
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] Hardware name: IBM IBM eServer BladeCenter HS21 -[8853L6U]-/Server Blade, BIOS -[BCE142BUS-1.18]- 06/17/2009
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] task: ffff8800c9f84f00 ti: ffff88012a344000 task.ti: ffff88012a344000
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RIP: 0010:[<ffffffff8111b5aa>]  [<ffffffff8111b5aa>] generic_exec_single+0xea/0x1a0
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RSP: 0000:ffff88012a347be8  EFLAGS: 00000202
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RAX: 00000000000008fb RBX: 0000000000000141 RCX: 0000000000000000
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RDX: 00000000000008fb RSI: 00000000000000fb RDI: 0000000000000282
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] RBP: ffff88012a347c28 R08: 0000000000000001 R09: ffff880000000e18
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] R10: ffff880000000000 R11: 0000000000130000 R12: 000200da00000001
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] R13: 0000000000000000 R14: ffff88012ffe8d80 R15: 0000010000000000
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] FS: 00007f41505c5880(0000) GS:ffff88012fd00000(0000) knlGS:0000000000000000
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] CR2: 00007ffcc75c3fa8 CR3: 000000012a324000 CR4: 00000000000007e0
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] Stack:
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  ffff88012a347c08 0000000000000000 ffffffff8106be30 ffff88012a347cb8
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  0000000000000003 00000000cfa942cb 0000000000000001 ffffffff8106be30
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  ffff88012a347c58 ffffffff8111b6c0 ffff88012a347c58 ffffffff8139ec2b
> Apr  7 23:11:53 labsan2 kernel: [18108.117000] Call Trace:
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106be30>] ? do_flush_tlb_all+0x50/0x50
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106be30>] ? do_flush_tlb_all+0x50/0x50
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8111b6c0>] smp_call_function_single+0x60/0xa0
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8139ec2b>] ? cpumask_next_and+0x3b/0x50
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106be30>] ? do_flush_tlb_all+0x50/0x50
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8111bce3>] smp_call_function_many+0x233/0x280
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106c048>] native_flush_tlb_others+0xb8/0xc0
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff8106c2e4>] flush_tlb_page+0x54/0x90
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff811e556e>] ptep_clear_flush+0x4e/0x60
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff811d2333>] do_wp_page+0x333/0x940
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff811d3d03>] handle_mm_fault+0xe43/0x1850
> Apr  7 23:11:53 labsan2 kernel: [18108.117000]  [<ffffffff812686ae>] ? ep_scan_ready_list+0x20e/0x220
> 

Mmm, without the complete log it's hard to tell what's actually going
on.

So at this point I'm leaning towards the initial trigger being that your
backend LLD driver is not returning I/Os before ESX's ABORT_TASK timer
fires.  IIRC, ESX's qla2xxx timeout is 20 seconds.

This typically means that something outside of the target is happening,
that results in outstanding LLD I/Os not being completed fast enough, or
not being completed at all back to the target backend driver. (IBLOCK in
your case)

It's a bit unusual that you've got two different machines with two
different HBAs (aacraid + mptsas..?) displaying similar behavior..

So I'd recommend first trying to eliminate the LLDs driver(s) in
question, and verify using a ramdisk that's large enough to hold a
working VMFS and see you are still able to reproduce.

>From there, I'd make sure your using the very latest firmware on the
HBAs in question, and re-verify with the 'supported' vendor drivers from
LSI and adaptec.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html