System crashes with increased drive count

Jun Wu <jwu@xxxxxxxxxxxx> · Thu, 8 May 2014 19:17:35 -0700

We are running in system crashes as number of drive under test
increases. The test configuration is one initiator as server running
fio sessions to remote drives on a target server via fcoe vn2vn. Both
servers running fedora 20 (kernel 3.14.2-200). Running fio sessions up
to 7 remote drives works but target machines hangs when drive count
increased to 8. The system crashes are very repeatable and duplicated
on RHEL 6.5. Following are error messages on  target server:

[ 1503.737314] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000048
[ 1503.737442] IP: [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.737540] PGD 0
[ 1503.737575] Oops: 0000 [#1] SMP
[ 1503.737631] Modules linked in: tcm_fc target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp mrp fuse
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw coretemp iTCO_wdt kvm_intel kvm gpio_ich
iTCO_vendor_support ses crc32c_intel tpm_tis enclosure i7core_edac
ioatdma edac_core shpchp serio_raw tpm lpc_ich mfd_core i2c_i801
microcode acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm
[ 1503.738933]  igb ixgbe drm ata_generic mdio ptp pata_acpi pps_core
pata_jmicron i2c_algo_bit aacraid dca i2c_core
[ 1503.739118] CPU: 5 PID: 6537 Comm: kworker/5:4 Not tainted
3.14.2-200.fc20.x86_64 #1
[ 1503.739225] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c       10/28/2011
[ 1503.739338] Workqueue: target_completion target_complete_ok_work
[target_core_mod]
[ 1503.739449] task: ffff88062071d580 ti: ffff88061a322000 task.ti:
ffff88061a322000
[ 1503.739553] RIP: 0010:[<ffffffffa0610885>]  [<ffffffffa0610885>]
ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.739681] RSP: 0018:ffff88061a323ce8  EFLAGS: 00010016
[ 1503.739755] RAX: 0000000000000000 RBX: ffff880304a23498 RCX: 0000000000009010
[ 1503.739853] RDX: 0000000000009010 RSI: 00000000000000cb RDI: 0000000000000000
[ 1503.739953] RBP: ffff88061a323d08 R08: ffff88031c4c6500 R09: 000000018020000f
[ 1503.740051] R10: ffffffff815cfe87 R11: ffffea000c713180 R12: ffff88031c4c6500
[ 1503.740150] R13: ffff88031f7c1f80 R14: ffff88031f7c1fe8 R15: 0000000000000000
[ 1503.740250] FS:  0000000000000000(0000) GS:ffff88063fc20000(0000)
knlGS:0000000000000000
[ 1503.740363] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1503.740443] CR2: 0000000000000048 CR3: 0000000001c0c000 CR4: 00000000000007e0
[ 1503.740541] Stack:
[ 1503.740572]  ffffffffa060e058 ffff880304a23568 ffff88031f7c1f80
ffff880304a234a8
[ 1503.740692]  ffff88061a323d18 ffffffffa060e5e2 ffff88061a323d40
ffffffffa05cff42
[ 1503.740812]  ffff880304a234a8 ffff880304a23568 0000000000000246
ffff88061a323d70
[ 1503.740931] Call Trace:
[ 1503.740969]  [<ffffffffa060e058>] ? ft_free_cmd+0x58/0x60 [tcm_fc]
[ 1503.741057]  [<ffffffffa060e5e2>] ft_release_cmd+0x12/0x20 [tcm_fc]
[ 1503.741150]  [<ffffffffa05cff42>] target_release_cmd_kref+0x52/0x80
[target_core_mod]
[ 1503.741264]  [<ffffffffa05d1bd3>] transport_release_cmd+0xd3/0xf0
[target_core_mod]
[ 1503.741377]  [<ffffffffa05d1c28>]
transport_generic_free_cmd+0x38/0x250 [target_core_mod]
[ 1503.741491]  [<ffffffffa060e600>] ft_check_stop_free+0x10/0x20 [tcm_fc]
[ 1503.741590]  [<ffffffffa05cfe32>]
transport_cmd_check_stop+0xc2/0x140 [target_core_mod]
[ 1503.741708]  [<ffffffffa05d3a97>]
target_complete_ok_work+0xe7/0x2d0 [target_core_mod]
[ 1503.741824]  [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 1503.741907]  [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 1503.741985]  [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 1503.742069]  [<ffffffff810ae211>] kthread+0xe1/0x100
[ 1503.742138]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 1503.742227]  [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0
[ 1503.746668]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 1503.751104] Code: 48 89 f0 48 89 c7 89 d6 48 89 e5 48 8b 49 10 48
89 ca e8 4f ed ff ff 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <8b> 47 48 85 c0 74 22 48 8d 47 48 f0 83 6f 48 01 74 09 c3
0f 1f
[ 1503.760558] RIP  [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.765145]  RSP <ffff88061a323ce8>
[ 1503.769687] CR2: 0000000000000048
[ 1503.789003] ---[ end trace c7457ccb45bf0bc9 ]---

Before target hangs, a lot of messages as follows are printed out on
the initiator:

fio: io_u error on file /dev/sdl: Input/output error
     read offset=1030152192, buflen=4096

[ 3787.971900] sd 8:0:0:0: [sdl] Unhandled error code
[ 3787.971907] sd 8:0:0:0: [sdl]
[ 3787.971910] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[ 3787.971913] sd 8:0:0:0: [sdl] CDB:
[ 3787.971915] Read(10): 28 00 00 1e b3 70 00 00 08 00
[ 3787.971924] end_request: I/O error, dev sdl, sector 2012016

Installation steps used:
yum install lldpad
yum install fcoe-utils
modprobe fcoe
yum install targetcli

Before these tests, we also installed Redhat 6.5 and followed
instructions on https://www.open-fcoe.org/. On Redhat, I were only
able to run fio to 3 target drives. Using 4 target drives crashed the
target machine.

Any suggestions?

Thanks,

Jun
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html