On Thu, 2014-05-08 at 19:17 -0700, Jun Wu wrote: > We are running in system crashes as number of drive under test > increases. The test configuration is one initiator as server running > fio sessions to remote drives on a target server via fcoe vn2vn. Both > servers running fedora 20 (kernel 3.14.2-200). Running fio sessions up > to 7 remote drives works but target machines hangs when drive count > increased to 8. The system crashes are very repeatable and duplicated > on RHEL 6.5. Following are error messages on target server: > > > [ 1503.737314] BUG: unable to handle kernel NULL pointer dereference > at 0000000000000048 > [ 1503.737442] IP: [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc] > [ 1503.737540] PGD 0 > [ 1503.737575] Oops: 0000 [#1] SMP > [ 1503.737631] Modules linked in: tcm_fc target_core_pscsi > target_core_file target_core_iblock iscsi_target_mod target_core_mod > fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp mrp fuse > ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute > bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 > nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security > ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 > nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle > iptable_security iptable_raw coretemp iTCO_wdt kvm_intel kvm gpio_ich > iTCO_vendor_support ses crc32c_intel tpm_tis enclosure i7core_edac > ioatdma edac_core shpchp serio_raw tpm lpc_ich mfd_core i2c_i801 > microcode acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon > drm_kms_helper ttm > [ 1503.738933] igb ixgbe drm ata_generic mdio ptp pata_acpi pps_core > pata_jmicron i2c_algo_bit aacraid dca i2c_core > [ 1503.739118] CPU: 5 PID: 6537 Comm: kworker/5:4 Not tainted > 3.14.2-200.fc20.x86_64 #1 > [ 1503.739225] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011 > [ 1503.739338] Workqueue: target_completion target_complete_ok_work > [target_core_mod] > [ 1503.739449] task: ffff88062071d580 ti: ffff88061a322000 task.ti: > ffff88061a322000 > [ 1503.739553] RIP: 0010:[<ffffffffa0610885>] [<ffffffffa0610885>] > ft_sess_put+0x5/0x30 [tcm_fc] > [ 1503.739681] RSP: 0018:ffff88061a323ce8 EFLAGS: 00010016 > [ 1503.739755] RAX: 0000000000000000 RBX: ffff880304a23498 RCX: 0000000000009010 > [ 1503.739853] RDX: 0000000000009010 RSI: 00000000000000cb RDI: 0000000000000000 > [ 1503.739953] RBP: ffff88061a323d08 R08: ffff88031c4c6500 R09: 000000018020000f > [ 1503.740051] R10: ffffffff815cfe87 R11: ffffea000c713180 R12: ffff88031c4c6500 > [ 1503.740150] R13: ffff88031f7c1f80 R14: ffff88031f7c1fe8 R15: 0000000000000000 > [ 1503.740250] FS: 0000000000000000(0000) GS:ffff88063fc20000(0000) > knlGS:0000000000000000 > [ 1503.740363] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 1503.740443] CR2: 0000000000000048 CR3: 0000000001c0c000 CR4: 00000000000007e0 > [ 1503.740541] Stack: > [ 1503.740572] ffffffffa060e058 ffff880304a23568 ffff88031f7c1f80 > ffff880304a234a8 > [ 1503.740692] ffff88061a323d18 ffffffffa060e5e2 ffff88061a323d40 > ffffffffa05cff42 > [ 1503.740812] ffff880304a234a8 ffff880304a23568 0000000000000246 > ffff88061a323d70 > [ 1503.740931] Call Trace: > [ 1503.740969] [<ffffffffa060e058>] ? ft_free_cmd+0x58/0x60 [tcm_fc] > [ 1503.741057] [<ffffffffa060e5e2>] ft_release_cmd+0x12/0x20 [tcm_fc] > [ 1503.741150] [<ffffffffa05cff42>] target_release_cmd_kref+0x52/0x80 > [target_core_mod] > [ 1503.741264] [<ffffffffa05d1bd3>] transport_release_cmd+0xd3/0xf0 > [target_core_mod] > [ 1503.741377] [<ffffffffa05d1c28>] > transport_generic_free_cmd+0x38/0x250 [target_core_mod] > [ 1503.741491] [<ffffffffa060e600>] ft_check_stop_free+0x10/0x20 [tcm_fc] > [ 1503.741590] [<ffffffffa05cfe32>] > transport_cmd_check_stop+0xc2/0x140 [target_core_mod] > [ 1503.741708] [<ffffffffa05d3a97>] > target_complete_ok_work+0xe7/0x2d0 [target_core_mod] > [ 1503.741824] [<ffffffff810a6886>] process_one_work+0x176/0x430 > [ 1503.741907] [<ffffffff810a74db>] worker_thread+0x11b/0x3a0 > [ 1503.741985] [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370 > [ 1503.742069] [<ffffffff810ae211>] kthread+0xe1/0x100 > [ 1503.742138] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 > [ 1503.742227] [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0 > [ 1503.746668] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 > [ 1503.751104] Code: 48 89 f0 48 89 c7 89 d6 48 89 e5 48 8b 49 10 48 > 89 ca e8 4f ed ff ff 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 > 66 66 66 90 <8b> 47 48 85 c0 74 22 48 8d 47 48 f0 83 6f 48 01 74 09 c3 > 0f 1f > [ 1503.760558] RIP [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc] > [ 1503.765145] RSP <ffff88061a323ce8> > [ 1503.769687] CR2: 0000000000000048 > [ 1503.789003] ---[ end trace c7457ccb45bf0bc9 ]--- > > > The v3.14 OOPs above looks like a free-after-use regression from the v3.13 conversion to use percpu-ida for pre-allocation of ft_cmd descriptors. Here's the patch that I'm applying to address this specific bug in tcm_fc. Please apply it and verify the fix on your end. >From 1d8dc8a29cfa6d66e5068ab6dad3216fe218cc53 Mon Sep 17 00:00:00 2001 From: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> Date: Mon, 12 May 2014 12:18:32 -0700 Subject: [PATCH] tcm_fc: Fix free-after-use regression in ft_free_cmd This patch fixes a free-after-use regression in ft_free_cmd(), where percpu_ida_free() was incorrectly called to release the tag before ft_sess_put() is called to drop the session reference. Fix this bug by moving the percpu_ida_free() call after ft_free_cmd(). The regression was originally introduced in v3.13-rc1 commit: commit 5f544cfac956971099e906f94568bc3fd1a7108a Author: Nicholas Bellinger <nab@xxxxxxxxxxxxx> Date: Mon Sep 23 12:12:42 2013 -0700 tcm_fc: Convert to per-cpu command map pre-allocation of ft_cmd Reported-by: Jun Wu <jwu@xxxxxxxxxxxx> Cc: Mark Rustad <mark.d.rustad@xxxxxxxxx> Cc: Robert Love <robert.w.love@xxxxxxxxx> Cc: <stable@xxxxxxxxxxxxxxx> #3.13+ Signed-off-by: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> --- drivers/target/tcm_fc/tfc_cmd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/target/tcm_fc/tfc_cmd.c b/drivers/target/tcm_fc/tfc_cmd.c index 01cf37f..28fce39 100644 --- a/drivers/target/tcm_fc/tfc_cmd.c +++ b/drivers/target/tcm_fc/tfc_cmd.c @@ -100,8 +100,8 @@ static void ft_free_cmd(struct ft_cmd *cmd) if (fr_seq(fp)) lport->tt.seq_release(fr_seq(fp)); fc_frame_free(fp); - percpu_ida_free(&se_sess->sess_tag_pool, cmd->se_cmd.map_tag); ft_sess_put(cmd->sess); /* undo get from lookup at recv */ + percpu_ida_free(&se_sess->sess_tag_pool, cmd->se_cmd.map_tag); } void ft_release_cmd(struct se_cmd *se_cmd) -- 1.7.10.4 > Before target hangs, a lot of messages as follows are printed out on > the initiator: > > fio: io_u error on file /dev/sdl: Input/output error > read offset=1030152192, buflen=4096 > > [ 3787.971900] sd 8:0:0:0: [sdl] Unhandled error code > [ 3787.971907] sd 8:0:0:0: [sdl] > [ 3787.971910] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK > [ 3787.971913] sd 8:0:0:0: [sdl] CDB: > [ 3787.971915] Read(10): 28 00 00 1e b3 70 00 00 08 00 > [ 3787.971924] end_request: I/O error, dev sdl, sector 2012016 > > Not sure what's going on here without more information. > Installation steps used: > yum install lldpad > yum install fcoe-utils > modprobe fcoe > yum install targetcli > > Before these tests, we also installed Redhat 6.5 and followed > instructions on https://www.open-fcoe.org/. On Redhat, I were only > able to run fio to 3 target drives. Using 4 target drives crashed the > target machine. > No idea without more info wrt RHEL 6.5, but it certainly doesn't have the v3.13+ specific percpu-ida regression from above. --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html