Re: [Open-FCoE] System crashes with increased drive count

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2014-05-20 at 22:29 -0700, Jun Wu wrote:
> MTU were 1500 for both initiator and target.
> I used "ethtool -K p4p1 tso off" to turn off tcp segmentation offload
> on all machines. Register setting after the command is shown below.
> 
> [root@poc3 jkong]# ethtool -k p4p1
> Features for p4p1:
> rx-checksumming: on
> tx-checksumming: on
>         tx-checksum-ipv4: on
>         tx-checksum-ip-generic: off [fixed]
>         tx-checksum-ipv6: on
>         tx-checksum-fcoe-crc: on [fixed]
>         tx-checksum-sctp: on
> scatter-gather: on
>         tx-scatter-gather: on
>         tx-scatter-gather-fraglist: off [fixed]
> tcp-segmentation-offload: off
>         tx-tcp-segmentation: off
>         tx-tcp-ecn-segmentation: off [fixed]
>         tx-tcp6-segmentation: off
> udp-fragmentation-offload: off [fixed]
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: off
> receive-hashing: on
> highdma: on [fixed]
> rx-vlan-filter: on
> vlan-challenged: off [fixed]
> tx-lockless: off [fixed]
> netns-local: off [fixed]
> tx-gso-robust: off [fixed]
> tx-fcoe-segmentation: on [fixed]
> tx-gre-segmentation: off [fixed]
> tx-ipip-segmentation: off [fixed]
> tx-sit-segmentation: off [fixed]
> tx-udp_tnl-segmentation: off [fixed]
> tx-mpls-segmentation: off [fixed]
> fcoe-mtu: on [fixed]
> tx-nocache-copy: on
> loopback: off [fixed]
> rx-fcs: off [fixed]
> rx-all: off
> tx-vlan-stag-hw-insert: off [fixed]
> rx-vlan-stag-hw-parse: off [fixed]
> rx-vlan-stag-filter: off [fixed]
> l2-fwd-offload: off
> 
> Info on NIC drivers
> 
> [root@poc3 jkong]# ethtool -i p4p1
> driver: ixgbe
> version: 3.15.1-k
> firmware-version: 0x80000208
> bus-info: 0000:08:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
> 
> After the change, I repeated the same test and got similar failure on
> target side:
> 
> [12253.032595] ft_queue_data_in: Failed to send frame
> ffff88062a638600, xid <0xa0c>, remaining 458752, lso_max <0x10000>

It is send frame failure and to find out what caused send failure more
debug info in low level fcoe Tx path functions will be helpful, it can
be  done by:-

# echo 0xFF > /sys/module/libfc/parameters/debug_logging
# echo 0x1 > /sys/module/fcoe/parameters/debug_logging

Disabling Tx offload may not help here and instead would slow down Tx,
so have them restored.

Also, are you using switch between hosts and target ? In any case you
would need DCB PFC or PAUSE enabled to avoid excessive Tx retries though
that should not cause send failure.


//Vasu

 
> [12253.032605] ft_queue_data_in: Failed to send frame
> ffff88062a638600, xid <0xa0c>, remaining 393216, lso_max <0x10000>
> [12253.032609] ft_queue_data_in: Failed to send frame
> ffff88062a638600, xid <0xa0c>, remaining 327680, lso_max <0x10000>
> [12253.032613] ft_queue_data_in: Failed to send frame
> ffff88062a638600, xid <0xa0c>, remaining 262144, lso_max <0x10000>
> [12284.299877] ft_queue_data_in: Failed to send frame
> ffff8803202ec600, xid <0x3a2>, remaining 196608, lso_max <0x10000>
> [12284.299885] ft_queue_data_in: Failed to send frame
> ffff8803202ec600, xid <0x3a2>, remaining 131072, lso_max <0x10000>
> [12284.299889] ft_queue_data_in: Failed to send frame
> ffff8803202ec600, xid <0x3a2>, remaining 65536, lso_max <0x10000>
> [12284.299892] ft_queue_data_in: Failed to send frame
> ffff8803202ec600, xid <0x3a2>, remaining 0, lso_max <0x10000>
> [12284.451810] ft_queue_data_in: Failed to send frame
> ffff88061deb1400, xid <0xecf>, remaining 458752, lso_max <0x10000>
> [12284.451818] ft_queue_data_in: Failed to send frame
> ffff88061deb1400, xid <0xecf>, remaining 393216, lso_max <0x10000>
> [12284.451824] ft_queue_data_in: Failed to send frame
> ffff88061deb1400, xid <0xecf>, remaining 327680, lso_max <0x10000>
> [12284.451827] ft_queue_data_in: Failed to send frame
> ffff88061deb1400, xid <0xecf>, remaining 262144, lso_max <0x10000>
> [12284.451831] ft_queue_data_in: Failed to send frame
> ffff88061deb1400, xid <0xecf>, remaining 196608, lso_max <0x10000>
> [12284.451834] ft_queue_data_in: Failed to send frame
> ffff88061deb1400, xid <0xecf>, remaining 131072, lso_max <0x10000>
> [12347.503478] ft_queue_data_in: 2 callbacks suppressed
> [12347.503486] ft_queue_data_in: Failed to send frame
> ffff8806142bc800, xid <0xb4f>, remaining 458752, lso_max <0x10000>
> [12347.503492] ft_queue_data_in: Failed to send frame
> ffff8806142bc800, xid <0xb4f>, remaining 393216, lso_max <0x10000>
> [12347.503496] ft_queue_data_in: Failed to send frame
> ffff8806142bc800, xid <0xb4f>, remaining 327680, lso_max <0x10000>
> [12347.503517] ft_queue_data_in: Failed to send frame
> ffff8806142bc800, xid <0xb4f>, remaining 262144, lso_max <0x10000>
> [12378.402412] ft_queue_data_in: Failed to send frame
> ffff88062ddeac00, xid <0x6a5>, remaining 458752, lso_max <0x10000>
> [12378.402420] ft_queue_data_in: Failed to send frame
> ffff88062ddeac00, xid <0x6a5>, remaining 393216, lso_max <0x10000>
> [12378.402425] ft_queue_data_in: Failed to send frame
> ffff88062ddeac00, xid <0x6a5>, remaining 327680, lso_max <0x10000>
> [12378.402428] ft_queue_data_in: Failed to send frame
> ffff88062ddeac00, xid <0x6a5>, remaining 262144, lso_max <0x10000>
> [12378.402432] ft_queue_data_in: Failed to send frame
> ffff88062ddeac00, xid <0x6a5>, remaining 196608, lso_max <0x10000>
> [12378.402436] ft_queue_data_in: Failed to send frame
> ffff88062ddeac00, xid <0x6a5>, remaining 131072, lso_max <0x10000>
> [12378.402440] ft_queue_data_in: Failed to send frame
> ffff88062ddeac00, xid <0x6a5>, remaining 65536, lso_max <0x10000>
> [12378.402444] ft_queue_data_in: Failed to send frame
> ffff88062ddeac00, xid <0x6a5>, remaining 0, lso_max <0x10000>
> [13049.224513] ft_queue_data_in: Failed to send frame
> ffff880614588c00, xid <0xd2f>, remaining 196608, lso_max <0x10000>
> [13049.224524] ft_queue_data_in: Failed to send frame
> ffff880614588c00, xid <0xd2f>, remaining 131072, lso_max <0x10000>
> [13049.224528] ft_queue_data_in: Failed to send frame
> ffff880614588c00, xid <0xd2f>, remaining 65536, lso_max <0x10000>
> [13049.224532] ft_queue_data_in: Failed to send frame
> ffff880614588c00, xid <0xd2f>, remaining 0, lso_max <0x10000>
> [13052.511306] ft_queue_data_in: Failed to send frame
> ffff88062d49f000, xid <0x8ae>, remaining 196608, lso_max <0x10000>
> [13052.511313] ft_queue_data_in: Failed to send frame
> ffff88062d49f000, xid <0x8ae>, remaining 131072, lso_max <0x10000>
> [13052.511317] ft_queue_data_in: Failed to send frame
> ffff88062d49f000, xid <0x8ae>, remaining 65536, lso_max <0x10000>
> [13052.511321] ft_queue_data_in: Failed to send frame
> ffff88062d49f000, xid <0x8ae>, remaining 0, lso_max <0x10000>
> [13087.976748] ft_queue_data_in: Failed to send frame
> ffff88031afc9c00, xid <0x96b>, remaining 458752, lso_max <0x10000>
> [13087.998453] ft_queue_data_in: Failed to send frame
> ffff88032c881200, xid <0xb23>, remaining 458752, lso_max <0x10000>
> [13087.998459] ft_queue_data_in: Failed to send frame
> ffff88032c881200, xid <0xb23>, remaining 393216, lso_max <0x10000>
> [13087.998463] ft_queue_data_in: Failed to send frame
> ffff88032c881200, xid <0xb23>, remaining 327680, lso_max <0x10000>
> [13087.998467] ft_queue_data_in: Failed to send frame
> ffff88032c881200, xid <0xb23>, remaining 262144, lso_max <0x10000>
> [13087.998470] ft_queue_data_in: Failed to send frame
> ffff88032c881200, xid <0xb23>, remaining 196608, lso_max <0x10000>
> [13087.998474] ft_queue_data_in: Failed to send frame
> ffff88032c881200, xid <0xb23>, remaining 131072, lso_max <0x10000>
> [13087.998478] ft_queue_data_in: Failed to send frame
> ffff88032c881200, xid <0xb23>, remaining 65536, lso_max <0x10000>
> [13087.998482] ft_queue_data_in: Failed to send frame
> ffff88032c881200, xid <0xb23>, remaining 0, lso_max <0x10000>
> [13119.177286] ft_queue_data_in: Failed to send frame
> ffff88062dff7400, xid <0xfcf>, remaining 458752, lso_max <0x10000>
> [13119.177297] ft_queue_data_in: Failed to send frame
> ffff88062dff7400, xid <0xfcf>, remaining 393216, lso_max <0x10000>
> [13119.177302] ft_queue_data_in: Failed to send frame
> ffff88062dff7400, xid <0xfcf>, remaining 327680, lso_max <0x10000>
> [13119.177307] ft_queue_data_in: Failed to send frame
> ffff88062dff7400, xid <0xfcf>, remaining 262144, lso_max <0x10000>
> [13119.177311] ft_queue_data_in: Failed to send frame
> ffff88062dff7400, xid <0xfcf>, remaining 196608, lso_max <0x10000>
> [13119.177316] ft_queue_data_in: Failed to send frame
> ffff88062dff7400, xid <0xfcf>, remaining 131072, lso_max <0x10000>
> [13119.177321] ft_queue_data_in: Failed to send frame
> ffff88062dff7400, xid <0xfcf>, remaining 65536, lso_max <0x10000>
> [13119.177325] ft_queue_data_in: Failed to send frame
> ffff88062dff7400, xid <0xfcf>, remaining 0, lso_max <0x10000>
> [13122.335322] ------------[ cut here ]------------
> [13122.335336] WARNING: CPU: 6 PID: 2165 at
> include/scsi/fc_frame.h:173 fcoe_percpu_receive_thread+0x507/0x53c
> [fcoe]()
> [13122.335338] Modules linked in: async_memcpy async_xor xor async_tx
> fcoe libfcoe tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
> target_core_file target_core_iblock iscsi_target_mod target_core_mod
> 8021q garp mrp bridge stp llc iTCO_wdt gpio_ich iTCO_vendor_support
> coretemp kvm_intel kvm crc32c_intel microcode serio_raw i2c_i801
> lpc_ich mfd_core ses enclosure i7core_edac ioatdma edac_core shpchp
> acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
> drm_kms_helper ttm drm ixgbe igb ata_generic mdio pata_acpi ptp
> pata_jmicron pps_core i2c_algo_bit aacraid dca i2c_core [last
> unloaded: vd]
> [13122.335390] CPU: 6 PID: 2165 Comm: fcoethread/6 Tainted: GF
>  O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
> [13122.335392] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c       10/28/2011
> [13122.335394]  0000000000000009 ffff88062b04bdd0 ffffffff81687eac
> 0000000000000000
> [13122.335400]  ffff88062b04be08 ffffffff8106d4dd ffffe8ffffc41748
> ffff88062a444700
> [13122.335404]  ffff8800b7e926e8 0000000000000002 ffff88062b04be88
> ffff88062b04be18
> [13122.335408] Call Trace:
> [13122.335419]  [<ffffffff81687eac>] dump_stack+0x45/0x56
> [13122.335426]  [<ffffffff8106d4dd>] warn_slowpath_common+0x7d/0xa0
> [13122.335430]  [<ffffffff8106d5ba>] warn_slowpath_null+0x1a/0x20
> [13122.335435]  [<ffffffffa0651517>]
> fcoe_percpu_receive_thread+0x507/0x53c [fcoe]
> [13122.335440]  [<ffffffffa0651010>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
> [13122.335446]  [<ffffffff8108f2f2>] kthread+0xd2/0xf0
> [13122.335450]  [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
> [13122.335458]  [<ffffffff81696dbc>] ret_from_fork+0x7c/0xb0
> [13122.335461]  [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
> [13122.335464] ---[ end trace e4509e1053f499ac ]---
> 
> Thanks,
> 
> Jun
> 
> On Tue, May 20, 2014 at 11:03 AM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
> > On Mon, 2014-05-19 at 17:29 -0700, Jun Wu wrote:
> >> Hi Nicholas,
> >>
> >> We downloaded the source of our running kernel (3.13.10-200) and
> >> applied your percpu-ida pre-allocation regression fix, then compiled
> >> and installed the kernel. I repeated the same test three times,
> >> running 10 fio sessions to 10 drives on the target through fcoe vn2vn.
> >> In the first two tests, the target machine hung with the following
> >> messages:
> >>
> >> 15231 May 19 11:49:27 poc1 kernel: [ 1073.783229] ft_queue_data_in:
> >> Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 196608,
> >> lso_max <0x10000>
> >> 15232 May 19 11:49:27 poc1 kernel: [ 1073.783238] ft_queue_data_in:
> >> Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 131072,
> >> lso_max <0x10000>
> >> 15233 May 19 11:49:27 poc1 kernel: [ 1073.783242] ft_queue_data_in:
> >> Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 65536,
> >> lso_max <0x10000>
> >> 15234 May 19 11:49:27 poc1 kernel: [ 1073.783245] ft_queue_data_in:
> >> Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 0,
> >> lso_max <0x10000>
> >> 15235 May 19 11:49:30 poc1 kernel: [ 1076.907061] ft_queue_data_in:
> >> Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 196608,
> >> lso_max <0x10000>
> >> 15236 May 19 11:49:30 poc1 kernel: [ 1076.907068] ft_queue_data_in:
> >> Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 131072,
> >> lso_max <0x10000>
> >> 15237 May 19 11:49:30 poc1 kernel: [ 1076.907073] ft_queue_data_in:
> >> Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 65536,
> >> lso_max <0x10000>
> >> 15238 May 19 11:49:30 poc1 kernel: [ 1076.907077] ft_queue_data_in:
> >> Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 0,
> >> lso_max <0x10000>
> >> 15239 May 19 11:50:01 poc1 kernel: [ 1107.918910] ft_queue_data_in:
> >> Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 458752,
> >> lso_max <0x10000>
> >> 15240 May 19 11:50:01 poc1 kernel: [ 1107.918918] ft_queue_data_in:
> >> Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 393216,
> >> lso_max <0x10000>
> >> 15241 May 19 11:50:01 poc1 kernel: [ 1107.918922] ft_queue_data_in:
> >> Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 327680,
> >> lso_max <0x10000>
> >> 15242 May 19 11:50:01 poc1 kernel: [ 1107.918925] ft_queue_data_in:
> >> Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 262144,
> >> lso_max <0x10000>
> >> 15243 May 19 11:50:01 poc1 kernel: [ 1107.918929] ft_queue_data_in:
> >> Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 196608,
> >> lso_max <0x10000>
> >> 15244 May 19 11:50:01 poc1 kernel: [ 1107.918932] ft_queue_data_in:
> >> Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 131072,
> >> lso_max <0x10000>
> >> 15245 May 19 11:50:01 poc1 kernel: [ 1107.918936] ft_queue_data_in:
> >> Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 65536,
> >> lso_max <0x10000>
> >> 15246 May 19 11:50:01 poc1 kernel: [ 1107.918939] ft_queue_data_in:
> >> Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 0,
> >> lso_max <0x10000>
> >> 15247 May 19 11:50:05 poc1 kernel: [ 1111.450900] ft_queue_data_in:
> >> Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 196608,
> >> lso_max <0x10000>
> >> 15248 May 19 11:50:05 poc1 kernel: [ 1111.450908] ft_queue_data_in:
> >> Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 131072,
> >> lso_max <0x10000>
> >> 15249 May 19 11:51:12 poc1 kernel: [ 1178.698434] ft_queue_data_in: 6
> >> callbacks suppressed
> >> 15250 May 19 11:51:12 poc1 kernel: [ 1178.698440] ft_queue_data_in:
> >> Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 458752,
> >> lso_max <0x10000>
> >> 15251 May 19 11:51:12 poc1 kernel: [ 1178.698446] ft_queue_data_in:
> >> Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 393216,
> >> lso_max <0x10000>
> >> 15252 May 19 11:51:12 poc1 kernel: [ 1178.698449] ft_queue_data_in:
> >> Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 327680,
> >> lso_max <0x10000>
> >> 15253 May 19 11:51:12 poc1 kernel: [ 1178.698453] ft_queue_data_in:
> >> Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 262144,
> >> lso_max <0x10000>
> >> 15254 May 19 11:51:12 poc1 kernel: [ 1178.698456] ft_queue_data_in:
> >> Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 196608,
> >> lso_max <0x10000>
> >> 15255 May 19 11:51:12 poc1 kernel: [ 1178.698460] ft_queue_data_in:
> >> Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 131072,
> >> lso_max <0x10000>
> >> 15256 May 19 11:51:12 poc1 kernel: [ 1178.698463] ft_queue_data_in:
> >> Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 65536,
> >> lso_max <0x10000>
> >> 15257 May 19 11:51:12 poc1 kernel: [ 1178.698467] ft_queue_data_in:
> >> Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 0,
> >> lso_max <0x10000>
> >>
> >
> > The call into lport->tt.seq_send() libfc code is failing to send
> > outgoing solicited data-in.  From the output, note the LSO (large
> > segment offload aka TCP segment offload) feature has been enabled by the
> > underlying NIC hardware.
> >
> > So in order to isolate possible issues, I'd recommend:
> >
> > - Disabling hardware offloads on both initiator and target sides (LRO +
> >   LSO) using ethtool -K
> > - Disabling any jumbo frames settings on either side
> >
> > Is there any other non standard network and/or switch settings that are
> > in place..?  Also, please confirm what your NIC + switch setup looks
> > like.
> >
> > Rob & Open-FCoE folks, is there anything else to take into consideration
> > here..?
> >
> >>
> >> I didn't see the previous message "unable to handle kernel NULL
> >> pointer dereference at 0000000000000048". So it must have been fixed
> >> by your change.
> >>
> >
> > Thanks for confirming that bit.
> >
> > --nab
> >
> _______________________________________________
> fcoe-devel mailing list
> fcoe-devel@xxxxxxxxxxxxx
> http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux