Re: [Open-FCoE] System crashes with increased drive count

Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> · Fri, 20 Jun 2014 11:29:17 -0700

On Thu, 2014-06-12 at 15:23 -0700, Jun Wu wrote:
> We tried the changes. The initiator was not able to see the drives on
> the target.
> I saw following messages:
> 
> Jun 12 09:45:58 poc2 kernel: [ 1629.051837] scsi host7: libfc: Host
> reset failed, port (00061e) is not ready.
> Jun 12 09:45:58 poc2 kernel: [ 1629.051843] scsi 7:0:0:0: Device
> offlined - not ready after error recovery
> Jun 12 09:45:58 poc2 kernel: [ 1629.052155] general protection fault:
> 0000 [#1] SMP
> Jun 12 09:45:58 poc2 kernel: scsi host7: libfc: Host reset failed,
> port (00061e) is not ready.
> Jun 12 09:45:58 poc2 kernel: scsi 7:0:0:0: Device offlined - not ready
> after error recovery
> Jun 12 09:45:58 poc2 kernel: general protection fault: 0000 [#1] SMP
> Jun 12 09:45:58 poc2 kernel: [ 1629.052245] Modules linked in: tcm_fc
> target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
> target_core_mod ipt_MASQUERADE xt_CHECKSUM 8021q fcoe libfcoe garp mrp
> libfc scsi_transport_fc scsi_tgt ip6t_rpfilter ip6t_REJECT
> xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
> ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
> nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
> iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm
> crc32c_intel microcode serio_raw ses enclosure nfsd auth_rpcgss
> i7core_edac nfs_acl lockd edac_core sunrpc lpc_ich ioatdma mfd_core
> shpchp i2c_i801 acpi_cpufreq radeon drm_kms_helper ttm ixgbe igb drm
> mdio ata_generic ptp pata_acpi pps_core i2c_algo_bit pata_jmicron
> aacraid i2c_core dca
> Jun 12 09:45:58 poc2 kernel: [ 1629.053754] CPU: 13 PID: 2488 Comm:
> kworker/13:3 Tainted: GF O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
> Jun 12 09:45:58 poc2 kernel: [ 1629.053898] Hardware name: Supermicro
> X8DTN/X8DTN, BIOS 2.1c 10/28/2011
> Jun 12 09:45:58 poc2 kernel: [ 1629.054005] Workqueue: fc_wq_7
> fc_rport_final_delete [scsi_transport_fc]
> Jun 12 09:45:58 poc2 kernel: [ 1629.054106] task: ffff880622794500 ti:
> ffff880622600000 task.ti: ffff880622600000
> Jun 12 09:45:58 poc2 kernel: [ 1629.054213] RIP:
> 0010:[<ffffffff81434e29>] [<ffffffff81434e29>]
> scsi_device_put+0x19/0x50

Looks like Scsi_Host is already released while its scsi device being
released here but the fc_remove_host called and then only finally scsi
host released. Looks like due to scsi device from pending scan work
which is recently fixed by this patch from Neil:-

http://patchwork.open-fcoe.org/patch/153/

I triggered host reset myself few times and didn't run into. I also
tried with DEBUG_OBJECTS_FREE enabled and no warning with that as well.

 [1277601.499732] host7: scsi: Resetting host
[1277601.500123] host7: lport 00a959: Entered RESET state from Ready
state
[1277601.501756] host7: rport 00aaf4: Remove port
[1277601.502121] host7: rport 00aaf4: Port sending LOGO from Ready state
[1277601.503145] host7: fip: els_send op 9 d_id aaf4
[1277601.503744] host7: rport 00aaf4: Delete port
[1277601.504214] host7: rport 00aaf4: work event 3
[1277601.504489] host7: fcp: 00aaf4: Returning DID_ERROR to scsi-ml due
to FC_DATA_UNDRUN (scsi)
[1277601.504520] host7: xid  201: f_ctl  90000 seq  1
[1277601.504583] host7: rport 00aaf4: Received a LOGO response closed
[1277601.504583] host7: lport 00a959: Entered FLOGI state from reset
state
[1277601.504583] host7: lport 00a959: Entered READY from state FLOGI
[1277601.504834] line:1378line:1378
[1277601.504834] scsi host7: libfc: Host reset succeeded on port
(00a959)
[1277601.507137] host7: rport 00aaf4: callback ev 3
[1277601.508396] host7: fip: vn_rport_callback aaf4 event 3
[1277601.509136] host7: rport 00aaf4: work delete

//Vasu

> Jun 12 09:45:58 poc2 kernel: [ 1629.054338] RSP: 0018:ffff880622601d90
> EFLAGS: 00010202
> Jun 12 09:45:58 poc2 kernel: [ 1629.054415] RAX: 6e696d7200276465 RBX:
> ffff880035e32000 RCX: 00000001820001a0
> Jun 12 09:45:58 poc2 kernel: [ 1629.054515] RDX: 00000001820001a1 RSI:
> 00000000820001a0 RDI: ffff880035e32000
> Jun 12 09:45:58 poc2 kernel: [ 1629.054615] RBP: ffff880622601da0 R08:
> 0000000000000000 R09: 0000000000000001
> Jun 12 09:45:58 poc2 kernel: [ 1629.054715] R10: ffffea0018b22600 R11:
> ffffffff81316701 R12: ffff880035e32000
> Jun 12 09:45:58 poc2 kernel: [ 1629.054815] R13: ffff88062dead010 R14:
> ffff88062dead000 R15: ffff880035870c00
> Jun 12 09:45:58 poc2 kernel: [ 1629.054916] FS: 0000000000000000(0000)
> GS:ffff88063fca0000(0000) knlGS:0000000000000000
> Jun 12 09:45:58 poc2 kernel: [ 1629.055031] CS: 0010 DS: 0000 ES: 0000
> CR0: 000000008005003b
> Jun 12 09:45:58 poc2 kernel: [ 1629.055111] CR2: 00007f63cf4d1a70 CR3:
> 0000000001c0c000 CR4: 00000000000007e0
> Jun 12 09:45:58 poc2 kernel: [ 1629.055211] Stack:
> Jun 12 09:45:58 poc2 kernel: [ 1629.055241] ffff880035e32000
> ffff880035955860 ffff880622601de8 ffffffff81443348
> Jun 12 09:45:58 poc2 kernel: [ 1629.055362] 0000000000000202
> ffff88062dead000 ffff88062dead000 ffff880035955c40
> Jun 12 09:45:58 poc2 kernel: [ 1629.055483] ffff880035955860
> ffff880035955800 ffff88032e81a000 ffff880622601e20
> Jun 12 09:45:58 poc2 kernel: [ 1629.055604] Call Trace:
> Jun 12 09:45:58 poc2 kernel: [ 1629.055647] [<ffffffff81443348>]
> scsi_remove_target+0x168/0x210
> Jun 12 09:45:58 poc2 kernel: [ 1629.055737] [<ffffffffa0492e6c>]
> fc_rport_final_delete+0xac/0x1f0 [scsi_transport_fc]
> Jun 12 09:45:58 poc2 kernel: [ 1629.055855] [<ffffffff81087bf6>]
> process_one_work+0x176/0x430
> Jun 12 09:45:58 poc2 kernel: [ 1629.055941] [<ffffffff8108882b>]
> worker_thread+0x11b/0x3a0
> Jun 12 09:45:58 poc2 kernel: [ 1629.056022] [<ffffffff81088710>] ?
> rescuer_thread+0x350/0x350
> Jun 12 09:45:58 poc2 kernel: [ 1629.056110] [<ffffffff8108f2f2>]
> kthread+0xd2/0xf0
> Jun 12 09:45:58 poc2 kernel: [ 1629.056181] [<ffffffff8108f220>] ?
> insert_kthread_work+0x40/0x40
> Jun 12 09:45:58 poc2 kernel: [ 1629.056274] [<ffffffff81696dbc>]
> ret_from_fork+0x7c/0xb0
> Jun 12 09:45:58 poc2 kernel: [ 1629.056353] [<ffffffff8108f220>] ?
> insert_kthread_work+0x40/0x40
> 
> and on the other node:
> 
> Jun 12 09:46:21 poc1 kernel: [ 1667.721292] rport-7:0-0: blocked FC
> remote port time out: removing target and saving binding
> Jun 12 09:46:32 poc1 kernel: [ 1678.220757] scsi host7: libfc: Host
> reset succeeded on port (0003ec)
> Jun 12 09:46:42 poc1 kernel: [ 1688.229744] scsi host7: libfc: Host
> reset succeeded on port (0003ec)
> 
> Jun 12 09:46:52 poc1 kernel: [ 1698.238715] scsi 7:0:0:0: Device
> offlined - not ready after error recovery
> Jun 12 09:46:52 poc1 kernel: [ 1698.238916] BUG: unable to handle
> kernel NULL pointer dereference at (null)
> Jun 12 09:46:52 poc1 kernel: [ 1698.239043] IP: [<ffffffff81434e29>]
> scsi_device_put+0x19/0x50
> Jun 12 09:46:52 poc1 kernel: [ 1698.239138] PGD 0
> Jun 12 09:46:52 poc1 kernel: [ 1698.239171] Oops: 0000 [#1] SMP
> Jun 12 09:46:52 poc1 kernel: [ 1698.239227] Modules linked in: tcm_fc
> target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
> target_core_mod ipt_MASQUERADE xt_CHECKSUM 8021q fcoe garp libfcoe mrp
> libfc scsi_transport_fc scsi_tgt ip6t_rpfilter ip6t_REJECT
> xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
> ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
> nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
> iTCO_wdt gpio_ich iTCO_vendor_support coretemp kvm_intel kvm
> crc32c_intel microcode serio_raw i2c_i801 lpc_ich mfd_core ioatdma ses
> enclosure i7core_edac edac_core shpchp acpi_cpufreq nfsd auth_rpcgss
> nfs_acl lockd sunrpc radeon drm_kms_helper ttm ixgbe igb drm mdio
> ata_generic ptp pata_acpi pps_core i2c_algo_bit pata_jmicron aacraid
> dca i2c_core
> Jun 12 09:46:52 poc1 kernel: [ 1698.240740] CPU: 1 PID: 1155 Comm:
> kworker/1:2 Tainted: GF O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
> Jun 12 09:46:52 poc1 kernel: [ 1698.240879] Hardware name: Supermicro
> X8DTN/X8DTN, BIOS 2.1c 10/28/2011
> Jun 12 09:46:52 poc1 kernel: [ 1698.240984] Workqueue: fc_wq_7
> fc_starget_delete [scsi_transport_fc]
> Jun 12 09:46:52 poc1 kernel: [ 1698.241080] task: ffff88061c3e7020 ti:
> ffff88061cc60000 task.ti: ffff88061cc60000
> Jun 12 09:46:52 poc1 kernel: [ 1698.241186] RIP:
> 0010:[<ffffffff81434e29>] [<ffffffff81434e29>]
> scsi_device_put+0x19/0x50
> Jun 12 09:46:52 poc1 kernel: [ 1698.241309] RSP: 0018:ffff88061cc61db0
> EFLAGS: 00010202
> Jun 12 09:46:52 poc1 kernel: [ 1698.241385] RAX: 0000000000000000 RBX:
> ffff88061d364800 RCX: 00000001820001e8
> Jun 12 09:46:52 poc1 kernel: [ 1698.241485] RDX: 00000001820001e9 RSI:
> 00000000820001e8 RDI: ffff88061d364800
> Jun 12 09:46:52 poc1 kernel: [ 1698.241585] RBP: ffff88061cc61dc0 R08:
> 0000000000000000 R09: 0000000000000001
> Jun 12 09:46:52 poc1 kernel: [ 1698.241685] R10: ffffea0018701740 R11:
> ffffffff81316701 R12: ffff88061d364800
> Jun 12 09:46:52 poc1 kernel: [ 1698.241785] R13: ffff88061e13e010 R14:
> ffff88061e13e000 R15: ffff88060b5f9400
> Jun 12 09:46:52 poc1 kernel: [ 1698.241886] FS: 0000000000000000(0000)
> GS:ffff880627c20000(0000) knlGS:0000000000000000
> Jun 12 09:46:52 poc1 kernel: [ 1698.242002] CS: 0010 DS: 0000 ES: 0000
> CR0: 000000008005003b
> Jun 12 09:46:52 poc1 kernel: [ 1698.242082] CR2: 0000000000000000 CR3:
> 0000000001c0c000 CR4: 00000000000007e0
> Jun 12 09:46:52 poc1 kernel: [ 1698.242182] Stack:
> Jun 12 09:46:52 poc1 kernel: [ 1698.242212] ffff88061d364800
> ffff8806160bb860 ffff88061cc61e08 ffffffff81443348
> Jun 12 09:46:52 poc1 kernel: [ 1698.242333] 0000000000000202
> ffff88061e13e000 ffff8806160bb800 ffff88061c3c3080
> Jun 12 09:46:52 poc1 kernel: [ 1698.242453] ffff880627c33e00
> ffffe8f9e7c22e00 0000000000000040 ffff88061cc61e20
> Jun 12 09:46:52 poc1 kernel: [ 1698.242574] Call Trace:
> Jun 12 09:46:52 poc1 kernel: [ 1698.242615] [<ffffffff81443348>]
> scsi_remove_target+0x168/0x210
> Jun 12 09:46:52 poc1 kernel: [ 1698.242706] [<ffffffffa05461b2>]
> fc_starget_delete+0x22/0x30 [scsi_transport_fc]
> Jun 12 09:46:52 poc1 kernel: [ 1698.242815] [<ffffffff81087bf6>]
> process_one_work+0x176/0x430
> Jun 12 09:46:52 poc1 kernel: [ 1698.242901] [<ffffffff8108882b>]
> worker_thread+0x11b/0x3a0
> Jun 12 09:46:52 poc1 kernel: [ 1698.242982] [<ffffffff81088710>] ?
> rescuer_thread+0x350/0x350
> Jun 12 09:46:52 poc1 kernel: [ 1698.243069] [<ffffffff8108f2f2>]
> kthread+0xd2/0xf0
> Jun 12 09:46:52 poc1 kernel: [ 1698.243140] [<ffffffff8108f220>] ?
> insert_kthread_work+0x40/0x40
> Jun 12 09:46:52 poc1 kernel: [ 1698.243231] [<ffffffff81696dbc>]
> ret_from_fork+0x7c/0xb0
> Jun 12 09:46:52 poc1 kernel: [ 1698.243310] [<ffffffff8108f220>] ?
> insert_kthread_work+0x40/0x40
> Jun 12 09:46:52 poc1 kernel: [ 1698.243397] Code: c4 08 48 89 d8 5b 41
> 5c 41 5d 41 5e 41 5f 5d c3 66 90 66 66 66 66 90 55 48 89 e5 41 54 49
> 89 fc 53 48 8b 07 48 8b 80 c0 00 00 00 <48> 8b 18 48 85 db 74 0d 48 89
> df e8 f7 72 ca ff 48 85 c0 75 12
> Jun 12 09:46:52 poc1 kernel: [ 1698.243942] RIP [<ffffffff81434e29>]
> scsi_device_put+0x19/0x50
> Jun 12 09:46:52 poc1 kernel: [ 1698.244032] RSP <ffff88061cc61db0>
> Jun 12 09:46:52 poc1 kernel: [ 1698.244081] CR2: 0000000000000000
> Jun 12 09:46:52 poc1 kernel: [ 1698.263415] ---[ end trace c6e30b4311d1a181 ]---
> Jun 12 09:46:52 poc1 kernel: [ 1698.263501] BUG: unable to handle
> kernel paging request at ffffffffffffffd8
> 
> Thanks,
> Jun
> 
> On Wed, Jun 11, 2014 at 2:33 PM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
> > On Tue, 2014-06-10 at 19:40 -0700, Jun Wu wrote:
> >> On Tue, Jun 10, 2014 at 3:38 PM, Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> wrote:
> >> > On Tue, 2014-06-10 at 09:46 -0700, Jun Wu wrote:
> >> >> This a Supermicro chassis with redundant power supplies. We see the
> >> >> same failures with both SSDs or HDDs.
> >> >> The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
> >> >>
> >> >
> >> > Is iSCSI or AoE tests with same TCM core kernel with same target and
> >> > host NICs/switch ?
> >>
> >> We tested AoE with the same hardware/switch and test setup. AoE works
> >> except that it is not enterprise protocol and it doesn't provide
> >> performance. It doesn't use TCM.
> >>
> >> >
> >> > What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
> >> > typically used and required by fcoe", but you are using PAUSE and switch
> >> > cannot be eliminated as you mentioned before, these could affect more to
> >> > FCoE than other protocols, so can you ensure IO errors are not due to
> >> > frames losses w/o DCB/PFC in your setup ?
> >>
> >> The NIC is:
> >> [root@poc1 log]# lspci | grep 82599
> >> 08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> >> SFI/SFP+ Network Connection (rev 01)
> >>
> >> The issue should not be caused by frame losses. The systems work fine
> >> with other protocols.
> >>
> >> >
> >> > While possibly abort issues at target with zero timeout values but you
> >> > could avoid them completely by increasing scsi timeout and disabling REC
> >> > as discussed before.
> >> >
> >> > Please use inline response and avoid top posts.
> >> >
> >> > Thanks,
> >> > Vasu
> >> >
> >>
> >> Is the following cmd_per_lun fcoe related? Its default value is 3. And
> >> it doesn't allow me to change.
> >> /sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
> >>
> >> Nab,
> >> Could you please send the patches you mentioned for me to test?
> >
> > The two are in for-next here:
> >
> > tcm_fc: Generate TASK_SET_FULL status for DataIN failures
> > https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=b3e5fe1688b998ba5287a68667ef7cc568739e44
> >
> > tcm_fc: Generate TASK_SET_FULL status for response failures
> > https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=6dbe7f4e97d55eefcb471c41c16b62fca5f10c68
> >
> > --nab
> >

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html