Hi Martin, On Sun, 2012-12-23 at 17:06 +0100, Martin Svec wrote: > Hello, > > my iscsi stress tests triggered a bug during target shutdown in latest stable > kernels (3.7.1 and earlier). Turning off an iscsi target with active connections > during heavy random write i/o sometimes leads to kernel oopses on different > places, hung kernel tasks and other symptoms of memory corruption. Thanks for the detailed bug report, and apologies for the holiday delay.. > This calltrace best shows what happens: > > BUG: unable to handle kernel NULL pointer dereference at (null) > Entering kdb (current=0xffff88022693d9d0, pid 15) on processor 1 Oops: (null) > due to oops @ 0xffffffff812d6bb6 > Pid: 15, comm: kworker/1:0 Not tainted 3.7.1-00004-ga57fbc7-dirty #23 Dell Inc. PowerEdge R510/0DPRKF > RIP: 0010:[<ffffffff812d6bb6>] [<ffffffff812d6bb6>] __list_add+0x26/0xd0 > Call Trace: > [<ffffffffa00952ba>] iscsit_add_cmd_to_response_queue+0x9a/0x100 [iscsi_target_mod] > [<ffffffffa00a20f6>] lio_queue_status+0x26/0x30 [iscsi_target_mod] > [<ffffffffa003d818>] target_complete_ok_work+0x198/0x200 [target_core_mod] > [<ffffffff81056913>] process_one_work+0x133/0x410 > [<ffffffffa003d680>] ? target_qf_do_work+0x2a0/0x2a0 [target_core_mod] > [<ffffffff81057800>] worker_thread+0x160/0x3c0 > [<ffffffff810576a0>] ? manage_workers+0x280/0x280 > [<ffffffff8105cfc6>] kthread+0xc6/0xd0 > [<ffffffff8105cf00>] ? flush_kthread_worker+0xb0/0xb0 > [<ffffffff8164c02c>] ret_from_fork+0x7c/0xb0 > [<ffffffff8105cf00>] ? flush_kthread_worker+0xb0/0xb0 > > I put few printks to the code and found that iscsit_add_cmd_to_response_queue > was called _after_ iscsit_close_connection. So the iscsi termination logic seems > to be broken in the way that iscsi connection is closed and kfree'd while there are > still pending iscsi_cmds somewhere in the target referencing the deallocated > connection. Below is a dmesg snippet where iscsi_conn with address > ffff880224924800 is kfree'd (time 432.338664) but after successful termination > of the target, two outstanding cmds use this connection address in > iscsit_add_cmd_to_response_queue (time 432.479821). > > Dec 22 21:35:16 goose2 kernel: [ 432.338662] Moving to TARG_CONN_STATE_FREE. > Dec 22 21:35:16 goose2 kernel: [ 432.338664] MARTIN: freeing connection ffff880224924800. > Dec 22 21:35:16 goose2 kernel: [ 432.338665] Decremented iSCSI connection count to 0 from node: iqn.1998-01.com.vmware:hv-esx8 > Dec 22 21:35:16 goose2 kernel: [ 432.338666] Moving to TARG_SESS_STATE_FAILED. > Dec 22 21:35:16 goose2 kernel: [ 432.338671] TARGET_CORE[iSCSI]: Deregistered fabric_sess > Dec 22 21:35:16 goose2 kernel: [ 432.338673] Moving to TARG_SESS_STATE_FREE. > Dec 22 21:35:16 goose2 kernel: [ 432.338674] Released iSCSI session from node: iqn.1998-01.com.vmware:hv-esx8 > Dec 22 21:35:16 goose2 kernel: [ 432.338675] Decremented number of active iSCSI Sessions on iSCSI TPG: 1 to 0 > Dec 22 21:35:16 goose2 kernel: [ 432.338677] Released 1 iSCSI Session(s) from Target Portal Group: 1 > Dec 22 21:35:16 goose2 kernel: [ 432.338678] iSCSI_TPG[1] - Disabled iSCSI Target Portal Group > Dec 22 21:35:16 goose2 kernel: [ 432.338717] LIO_Target_ConfigFS: DEREGISTER -> iqn.2011-07.com.zoner:inetstore.vmw0.san2 TPGT: 1 PORTAL: 192.168.224.10:3260 > Dec 22 21:35:16 goose2 kernel: [ 432.338720] CORE[iqn.2011-07.com.zoner:inetstore.vmw0.san2] - Removed Network Portal: 192.168.224.10:3260,1 on TCP > Dec 22 21:35:16 goose2 kernel: [ 432.338788] CORE[0] - Removed Network Portal: 192.168.224.10:3260 on TCP > Dec 22 21:35:16 goose2 kernel: [ 432.338789] LIO_Target_ConfigFS: delnpfromtpg done! > Dec 22 21:35:16 goose2 kernel: [ 432.338838] iSCSI_TPG[1]_LUN[2] - Removed ACL for InitiatorNode: iqn.1998-01.com.vmware:hv-esx9 Mapped LUN: 2 > Dec 22 21:35:16 goose2 kernel: [ 432.338857] iSCSI_TPG[1] - Freeing ACL for iSCSI InitiatorNode: iqn.1998-01.com.vmware:hv-esx9 Mapped LUN: 2 > Dec 22 21:35:16 goose2 kernel: [ 432.338877] iSCSI_TPG[1]_LUN[1] - Removed ACL for InitiatorNode: iqn.1998-01.com.vmware:hv-esx9 Mapped LUN: 1 > Dec 22 21:35:16 goose2 kernel: [ 432.338894] iSCSI_TPG[1] - Freeing ACL for iSCSI InitiatorNode: iqn.1998-01.com.vmware:hv-esx9 Mapped LUN: 1 > Dec 22 21:35:16 goose2 kernel: [ 432.338933] iSCSI_TPG[1] - Deleted ACL with TCQ Depth: 64 for iSCSI Initiator Node: iqn.1998-01.com.vmware:hv-esx9 > Dec 22 21:35:16 goose2 kernel: [ 432.338965] iSCSI_TPG[1]_LUN[2] - Removed ACL for InitiatorNode: iqn.1998-01.com.vmware:hv-esx8 Mapped LUN: 2 > Dec 22 21:35:16 goose2 kernel: [ 432.338983] iSCSI_TPG[1] - Freeing ACL for iSCSI InitiatorNode: iqn.1998-01.com.vmware:hv-esx8 Mapped LUN: 2 > Dec 22 21:35:16 goose2 kernel: [ 432.339003] iSCSI_TPG[1]_LUN[1] - Removed ACL for InitiatorNode: iqn.1998-01.com.vmware:hv-esx8 Mapped LUN: 1 > Dec 22 21:35:16 goose2 kernel: [ 432.339024] iSCSI_TPG[1] - Freeing ACL for iSCSI InitiatorNode: iqn.1998-01.com.vmware:hv-esx8 Mapped LUN: 1 > Dec 22 21:35:16 goose2 kernel: [ 432.339063] iSCSI_TPG[1] - Deleted ACL with TCQ Depth: 64 for iSCSI Initiator Node: iqn.1998-01.com.vmware:hv-esx8 > Dec 22 21:35:16 goose2 kernel: [ 432.339184] iSCSI_TPG[1]_LUN[2] - Deactivated iSCSI Logical Unit from device object > Dec 22 21:35:16 goose2 kernel: [ 432.339313] iSCSI_TPG[1]_LUN[1] - Deactivated iSCSI Logical Unit from device object > Dec 22 21:35:16 goose2 kernel: [ 432.339353] LIO_Target_ConfigFS: DEREGISTER -> Releasing TPG > Dec 22 21:35:16 goose2 kernel: [ 432.339355] Released 0 iSCSI Session(s) from Target Portal Group: 1 > Dec 22 21:35:16 goose2 kernel: [ 432.339366] TARGET_CORE[Normal]: Deallocating iSCSI struct se_portal_group for endpoint: iqn.2011-07.com.zoner:inetstore.vmw0.san2 Portal Tag 1 > Dec 22 21:35:16 goose2 kernel: [ 432.339398] CORE[iqn.2011-07.com.zoner:inetstore.vmw0.san2]_TPG[1] - Deleted iSCSI Target Portal Group > ... > Dec 22 21:35:16 goose2 lrmd: [3117]: info: operation stop[153] on tpgvmw0 for client 3120: pid 8900 exited with return code 0 > ... > Dec 22 21:35:17 goose2 kernel: [ 432.479815] Incremented dev->dev_cur_ordered_id: 42204 for SIMPLE: 42204 > Dec 22 21:35:17 goose2 kernel: [ 432.479821] MARTIN: Adding ffff880210641a40 cmd to conn ffff880224924800 as qreq ffff88020fca4330 > Dec 22 21:35:17 goose2 kernel: [ 432.483560] Incremented dev->dev_cur_ordered_id: 42205 for SIMPLE: 42241 > Dec 22 21:35:17 goose2 kernel: [ 432.483564] MARTIN: Adding ffff88020e39eac0 cmd to conn ffff880224924800 as qreq ffff88020fca4230 > ... Ok, so this is signaling that the connection is being allowed to fully shutdown + release while se_cmd are still outstanding to TCM. Definitely a regression bug.. Digging into this a bit more, I'm starting to think the root problem is that the CMD_T_ACTIVE bit in se_cmd->transport_state is getting cleared by transport_cmd_check_stop() during transport_generic_new_cmd() for pending WRITEs before the cmd->se_tfo->write_pending() callback occurs, but is not getting reset until way too late in target_complete_cmd(). This bit used to be set each time in transport_add_cmd_to_queue() when a se_cmd was added to the legacy transport_processing_thread() logic (including for iSCSI WRITEs), but this has since been removed starting from v3.6 code. (Here is the v3.5.x reference for good measure) http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=blob;f=drivers/target/target_core_transport.c;h=0981707216b7f937db1b7a6931b9df113048ecef;hb=refs/heads/linux-3.5.y#l600 So here is a quick patch to explicitly re-set CMD_T_ACTIVE within target_execute_cmd(), which is the function that iscsi-target now calls in >= v3.6.x to submit individual WRITEs once the full data payload has been received. diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 978762d..1689c8b 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -1687,6 +1687,7 @@ void target_execute_cmd(struct se_cmd *cmd) return; } + cmd->transport_state |= CMD_T_ACTIVE; cmd->t_state = TRANSPORT_PROCESSING; spin_unlock_irq(&cmd->t_state_lock); Please give it a shot, and let me know if your able to reproduce. Thanks, --nab > Dec 22 21:35:17 goose2 kernel: [ 432.510296] LIO_Target_ConfigFS: DEREGISTER -> iqn.2011-07.com.zoner:inetstore.vmw0.san2 > Dec 22 21:35:17 goose2 kernel: [ 432.510302] CORE[0] - Deleted iSCSI Target IQN: iqn.2011-07.com.zoner:inetstore.vmw0.san2 > Dec 22 21:35:17 goose2 kernel: [ 432.510338] LIO_Target_ConfigFS: DEREGISTER -> iqn.2011-07.com.zoner:inetstore.vmw0.san1 > Dec 22 21:35:17 goose2 kernel: [ 432.510340] CORE[0] - Deleted iSCSI Target IQN: iqn.2011-07.com.zoner:inetstore.vmw0.san1 > Dec 22 21:35:17 goose2 kernel: [ 432.510485] Target_Core_ConfigFS: Calling se_free_virtual_device() for se_dev_ptr: ffff880211081c00 > Dec 22 21:35:17 goose2 kernel: [ 432.510527] Target_Core_ConfigFS: Deallocating se_subsystem_dev_t: ffff8802102d1000 > Dec 22 21:35:17 goose2 kernel: [ 432.510604] Target_Core_ConfigFS: Calling se_free_virtual_device() for se_dev_ptr: ffff88020fc71000 > Dec 22 21:35:17 goose2 kernel: [ 432.510666] Target_Core_ConfigFS: Deallocating se_subsystem_dev_t: ffff8802103c3000 > Dec 22 21:35:17 goose2 kernel: [ 432.510690] CORE_HBA[12] - Detached HBA from Generic Target Core > Dec 22 21:35:17 goose2 kernel: [ 432.510708] CORE_HBA[11] - Detached HBA from Generic Target Core > Dec 22 21:35:17 goose2 kernel: [ 432.510729] CORE_HBA[10] - Detached HBA from Generic Target Core > ... > Dec 22 21:35:44 goose2 kernel: [ 460.265788] BUG: soft lockup - CPU#4 stuck for 22s! [prmount:8994] > > Tested with 3.7.1 stable kernel, 3.6.x looks to have the same problem. > I skipped v3.4 and v3.5 so I don't know if the bug was introduced in v3.6 > or earlier. > > Martin > > -- > To unsubscribe from this list: send the line "unsubscribe target-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html