Hi Nicholas, We had to roll back system from 3.14 to 3.11 due to compile issues of our software. So I am not able to verify your fix at this point. I ran the same tests on 3.11 instead. In one case the target crashed with following message: May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging request at ffffffffffffffa4 May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40 May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0 May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq edac_core dca mperf radeon i2c_algo_bit May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic i2c_core pata_acpi pata_jmicron aacraid May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not tainted 3.11.10-301.fc20.x86_64 #1 May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011 May 13 13:06:25 poc2 kernel: Workqueue: target_completion target_complete_ok_work [target_core_mod] May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti: ffff88031bb78000 task.ti: ffff88031bb78000 May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>] [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40 May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0 EFLAGS: 00010206 May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX: ffffffffffffffa4 RCX: 0000000000000000 May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffa4 May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08: 00000000ffffffff R09: ffff88031a37f678 May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11: 0000000000000044 R12: 0000000000000000 May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14: ffff88062d9fd6c8 R15: ffff88032c6da05c May 13 13:06:25 poc2 kernel: FS: 0000000000000000(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000 May 13 13:06:25 poc2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3: 0000000001c0c000 CR4: 00000000000007f0 May 13 13:06:25 poc2 kernel: Stack: May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18 ffffffffa0594d2b ffff880328d13410 May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58 ffffffffa05356f2 0000000000000018 May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000 ffffea000c8eb640 0000000000000000 May 13 13:06:25 poc2 kernel: Call Trace: May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>] fc_seq_start_next+0x1b/0x40 [libfc] May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>] ft_queue_status+0xf2/0x220 [tcm_fc] May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>] ft_queue_data_in+0x72/0x5a0 [tcm_fc] May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>] target_complete_ok_work+0x14a/0x2b0 [target_core_mod] May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430 May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0 May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340 May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0 May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40 May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0 May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40 May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75 f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e 05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44 00 00 f3 90 0f b6 May 13 13:06:25 poc2 kernel: RIP [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40 In another case, the initiator crashed with: May 13 12:00:47 poc1 kernel: [ 4086.708455] WARNING: CPU: 1 PID: 1869 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0() May 13 12:00:47 poc1 kernel: [ 4086.708459] list_del corruption. next->prev should be ffff88061dab0318, but was ffff88061d257318 May 13 12:00:47 poc1 kernel: [ 4086.708461] Modules linked in: fcoe libfcoe 8021q garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw coretemp kvm_intel kvm crc32c_intel iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 ses igb enclosure lpc_ich mfd_core ixgbe ptp pps_core mdio i7core_edac ioatdma edac_core dca shpchp acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd sunrpc radeon i2c_algo_bit drm_kms_helper ttm drm ata_generic i2c_core pata_acpi pata_jmicron aacraid May 13 12:00:47 poc1 kernel: [ 4086.708556] CPU: 1 PID: 1869 Comm: fcoethread/1 Not tainted 3.11.10-301.fc20.x86_64 #1 May 13 12:00:47 poc1 kernel: [ 4086.708558] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011 May 13 12:00:47 poc1 kernel: [ 4086.708561] 0000000000000009 ffff8806129dfb40 ffffffff816441db ffff8806129dfb88 May 13 12:00:47 poc1 kernel: [ 4086.708569] ffff8806129dfb78 ffffffff8106715d ffff88061dab0318 ffff88061dab0a00 May 13 12:00:47 poc1 kernel: [ 4086.708576] 0000000000000286 ffff880c1b5e4388 0000000000000030 ffff8806129dfbd8 May 13 12:00:47 poc1 kernel: [ 4086.708582] Call Trace: May 13 12:00:47 poc1 kernel: [ 4086.708592] [<ffffffff816441db>] dump_stack+0x45/0x56 May 13 12:00:47 poc1 kernel: [ 4086.708598] [<ffffffff8106715d>] warn_slowpath_common+0x7d/0xa0 May 13 12:00:47 poc1 kernel: [ 4086.708602] [<ffffffff810671cc>] warn_slowpath_fmt+0x4c/0x50 May 13 12:00:47 poc1 kernel: [ 4086.708608] [<ffffffff81311dc2>] __list_del_entry+0x82/0xd0 May 13 12:00:47 poc1 kernel: [ 4086.708613] [<ffffffff81311e1d>] list_del+0xd/0x30 May 13 12:00:47 poc1 kernel: [ 4086.708624] [<ffffffffa05de23c>] fc_io_compl+0x1cc/0x710 [libfc] May 13 12:00:47 poc1 kernel: [ 4086.708633] [<ffffffffa05de7df>] fc_fcp_complete_locked+0x5f/0x1a0 [libfc] May 13 12:00:47 poc1 kernel: [ 4086.708642] [<ffffffffa05dfac9>] fc_fcp_resp.isra.22+0x79/0x2f0 [libfc] May 13 12:00:47 poc1 kernel: [ 4086.708651] [<ffffffff810a2a33>] ? load_balance+0xe3/0x740 May 13 12:00:47 poc1 kernel: [ 4086.708660] [<ffffffffa05e0424>] fc_fcp_recv+0x6e4/0xef0 [libfc] May 13 12:00:47 poc1 kernel: [ 4086.708666] [<ffffffff810115ce>] ? __switch_to+0x13e/0x4b0 May 13 12:00:47 poc1 kernel: [ 4086.708673] [<ffffffff8164aab5>] ? _raw_spin_unlock_bh+0x15/0x20 May 13 12:00:47 poc1 kernel: [ 4086.708682] [<ffffffffa05dfd40>] ? fc_fcp_resp.isra.22+0x2f0/0x2f0 [libfc] May 13 12:00:47 poc1 kernel: [ 4086.708690] [<ffffffffa05d421b>] fc_exch_recv+0x8eb/0xd70 [libfc] May 13 12:00:47 poc1 kernel: [ 4086.708695] [<ffffffffa0613299>] fcoe_percpu_receive_thread+0x299/0x540 [fcoe] May 13 12:00:47 poc1 kernel: [ 4086.708699] [<ffffffffa0613000>] ? fcoe_set_port_id+0x50/0x50 [fcoe] May 13 12:00:47 poc1 kernel: [ 4086.708705] [<ffffffff81088660>] kthread+0xc0/0xd0 May 13 12:00:47 poc1 kernel: [ 4086.708710] [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40 May 13 12:00:47 poc1 kernel: [ 4086.708717] [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0 May 13 12:00:47 poc1 kernel: [ 4086.708723] [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40 May 13 12:00:47 poc1 kernel: [ 4086.708728] ---[ end trace 61dc774d1f379191 ]--- [root@poc1 log]# lspci | grep 82599 08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) [root@poc1 log]# uname -a Linux poc1 3.11.10-301.fc20.x86_64 #1 SMP Thu Dec 5 14:01:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Thanks, Jun On Mon, May 12, 2014 at 12:34 PM, Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> wrote: > On Thu, 2014-05-08 at 19:17 -0700, Jun Wu wrote: >> We are running in system crashes as number of drive under test >> increases. The test configuration is one initiator as server running >> fio sessions to remote drives on a target server via fcoe vn2vn. Both >> servers running fedora 20 (kernel 3.14.2-200). Running fio sessions up >> to 7 remote drives works but target machines hangs when drive count >> increased to 8. The system crashes are very repeatable and duplicated >> on RHEL 6.5. Following are error messages on target server: >> >> >> [ 1503.737314] BUG: unable to handle kernel NULL pointer dereference >> at 0000000000000048 >> [ 1503.737442] IP: [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc] >> [ 1503.737540] PGD 0 >> [ 1503.737575] Oops: 0000 [#1] SMP >> [ 1503.737631] Modules linked in: tcm_fc target_core_pscsi >> target_core_file target_core_iblock iscsi_target_mod target_core_mod >> fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp mrp fuse >> ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute >> bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 >> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security >> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 >> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle >> iptable_security iptable_raw coretemp iTCO_wdt kvm_intel kvm gpio_ich >> iTCO_vendor_support ses crc32c_intel tpm_tis enclosure i7core_edac >> ioatdma edac_core shpchp serio_raw tpm lpc_ich mfd_core i2c_i801 >> microcode acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon >> drm_kms_helper ttm >> [ 1503.738933] igb ixgbe drm ata_generic mdio ptp pata_acpi pps_core >> pata_jmicron i2c_algo_bit aacraid dca i2c_core >> [ 1503.739118] CPU: 5 PID: 6537 Comm: kworker/5:4 Not tainted >> 3.14.2-200.fc20.x86_64 #1 >> [ 1503.739225] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011 >> [ 1503.739338] Workqueue: target_completion target_complete_ok_work >> [target_core_mod] >> [ 1503.739449] task: ffff88062071d580 ti: ffff88061a322000 task.ti: >> ffff88061a322000 >> [ 1503.739553] RIP: 0010:[<ffffffffa0610885>] [<ffffffffa0610885>] >> ft_sess_put+0x5/0x30 [tcm_fc] >> [ 1503.739681] RSP: 0018:ffff88061a323ce8 EFLAGS: 00010016 >> [ 1503.739755] RAX: 0000000000000000 RBX: ffff880304a23498 RCX: 0000000000009010 >> [ 1503.739853] RDX: 0000000000009010 RSI: 00000000000000cb RDI: 0000000000000000 >> [ 1503.739953] RBP: ffff88061a323d08 R08: ffff88031c4c6500 R09: 000000018020000f >> [ 1503.740051] R10: ffffffff815cfe87 R11: ffffea000c713180 R12: ffff88031c4c6500 >> [ 1503.740150] R13: ffff88031f7c1f80 R14: ffff88031f7c1fe8 R15: 0000000000000000 >> [ 1503.740250] FS: 0000000000000000(0000) GS:ffff88063fc20000(0000) >> knlGS:0000000000000000 >> [ 1503.740363] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> [ 1503.740443] CR2: 0000000000000048 CR3: 0000000001c0c000 CR4: 00000000000007e0 >> [ 1503.740541] Stack: >> [ 1503.740572] ffffffffa060e058 ffff880304a23568 ffff88031f7c1f80 >> ffff880304a234a8 >> [ 1503.740692] ffff88061a323d18 ffffffffa060e5e2 ffff88061a323d40 >> ffffffffa05cff42 >> [ 1503.740812] ffff880304a234a8 ffff880304a23568 0000000000000246 >> ffff88061a323d70 >> [ 1503.740931] Call Trace: >> [ 1503.740969] [<ffffffffa060e058>] ? ft_free_cmd+0x58/0x60 [tcm_fc] >> [ 1503.741057] [<ffffffffa060e5e2>] ft_release_cmd+0x12/0x20 [tcm_fc] >> [ 1503.741150] [<ffffffffa05cff42>] target_release_cmd_kref+0x52/0x80 >> [target_core_mod] >> [ 1503.741264] [<ffffffffa05d1bd3>] transport_release_cmd+0xd3/0xf0 >> [target_core_mod] >> [ 1503.741377] [<ffffffffa05d1c28>] >> transport_generic_free_cmd+0x38/0x250 [target_core_mod] >> [ 1503.741491] [<ffffffffa060e600>] ft_check_stop_free+0x10/0x20 [tcm_fc] >> [ 1503.741590] [<ffffffffa05cfe32>] >> transport_cmd_check_stop+0xc2/0x140 [target_core_mod] >> [ 1503.741708] [<ffffffffa05d3a97>] >> target_complete_ok_work+0xe7/0x2d0 [target_core_mod] >> [ 1503.741824] [<ffffffff810a6886>] process_one_work+0x176/0x430 >> [ 1503.741907] [<ffffffff810a74db>] worker_thread+0x11b/0x3a0 >> [ 1503.741985] [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370 >> [ 1503.742069] [<ffffffff810ae211>] kthread+0xe1/0x100 >> [ 1503.742138] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 >> [ 1503.742227] [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0 >> [ 1503.746668] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 >> [ 1503.751104] Code: 48 89 f0 48 89 c7 89 d6 48 89 e5 48 8b 49 10 48 >> 89 ca e8 4f ed ff ff 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 >> 66 66 66 90 <8b> 47 48 85 c0 74 22 48 8d 47 48 f0 83 6f 48 01 74 09 c3 >> 0f 1f >> [ 1503.760558] RIP [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc] >> [ 1503.765145] RSP <ffff88061a323ce8> >> [ 1503.769687] CR2: 0000000000000048 >> [ 1503.789003] ---[ end trace c7457ccb45bf0bc9 ]--- >> >> >> > > The v3.14 OOPs above looks like a free-after-use regression from the > v3.13 conversion to use percpu-ida for pre-allocation of ft_cmd > descriptors. > > Here's the patch that I'm applying to address this specific bug in > tcm_fc. Please apply it and verify the fix on your end. > > From 1d8dc8a29cfa6d66e5068ab6dad3216fe218cc53 Mon Sep 17 00:00:00 2001 > From: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> > Date: Mon, 12 May 2014 12:18:32 -0700 > Subject: [PATCH] tcm_fc: Fix free-after-use regression in ft_free_cmd > > This patch fixes a free-after-use regression in ft_free_cmd(), where > percpu_ida_free() was incorrectly called to release the tag before > ft_sess_put() is called to drop the session reference. > > Fix this bug by moving the percpu_ida_free() call after ft_free_cmd(). > > The regression was originally introduced in v3.13-rc1 commit: > > commit 5f544cfac956971099e906f94568bc3fd1a7108a > Author: Nicholas Bellinger <nab@xxxxxxxxxxxxx> > Date: Mon Sep 23 12:12:42 2013 -0700 > > tcm_fc: Convert to per-cpu command map pre-allocation of ft_cmd > > Reported-by: Jun Wu <jwu@xxxxxxxxxxxx> > Cc: Mark Rustad <mark.d.rustad@xxxxxxxxx> > Cc: Robert Love <robert.w.love@xxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> #3.13+ > Signed-off-by: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> > --- > drivers/target/tcm_fc/tfc_cmd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/target/tcm_fc/tfc_cmd.c b/drivers/target/tcm_fc/tfc_cmd.c > index 01cf37f..28fce39 100644 > --- a/drivers/target/tcm_fc/tfc_cmd.c > +++ b/drivers/target/tcm_fc/tfc_cmd.c > @@ -100,8 +100,8 @@ static void ft_free_cmd(struct ft_cmd *cmd) > if (fr_seq(fp)) > lport->tt.seq_release(fr_seq(fp)); > fc_frame_free(fp); > - percpu_ida_free(&se_sess->sess_tag_pool, cmd->se_cmd.map_tag); > ft_sess_put(cmd->sess); /* undo get from lookup at recv */ > + percpu_ida_free(&se_sess->sess_tag_pool, cmd->se_cmd.map_tag); > } > > void ft_release_cmd(struct se_cmd *se_cmd) > -- > 1.7.10.4 > >> Before target hangs, a lot of messages as follows are printed out on >> the initiator: >> >> fio: io_u error on file /dev/sdl: Input/output error >> read offset=1030152192, buflen=4096 >> >> [ 3787.971900] sd 8:0:0:0: [sdl] Unhandled error code >> [ 3787.971907] sd 8:0:0:0: [sdl] >> [ 3787.971910] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK >> [ 3787.971913] sd 8:0:0:0: [sdl] CDB: >> [ 3787.971915] Read(10): 28 00 00 1e b3 70 00 00 08 00 >> [ 3787.971924] end_request: I/O error, dev sdl, sector 2012016 >> >> > > Not sure what's going on here without more information. > >> Installation steps used: >> yum install lldpad >> yum install fcoe-utils >> modprobe fcoe >> yum install targetcli >> >> Before these tests, we also installed Redhat 6.5 and followed >> instructions on https://www.open-fcoe.org/. On Redhat, I were only >> able to run fio to 3 target drives. Using 4 target drives crashed the >> target machine. >> > > No idea without more info wrt RHEL 6.5, but it certainly doesn't have > the v3.13+ specific percpu-ida regression from above. > > --nab > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html