On Sun, 2011-09-18 at 11:50 +0200, Martin Svec wrote: > Hi Nick, > > Below is the requested dynamic_debug output. I highlighted > the lines you wanted to convert to printk by "[TCM]" prefix. > > Few notes: > > I'm not sure if it is a 3.1 regression because I've made a number > of changes to my SAN setup since 2.6.39 tests. Now, I have two > targets (san1 and san2) that provide multipathed access to the > same LUN under two different IQNs and two different NPs. > I put a mark in the output to show where the successful san1 > TPG cleanup ends and failed san2 TPG removal starts. > > Also, I simulate loss of cabling by setting a firewall rule "DROP all > INPUT on port 3260" immediately before the TPG removal sequence. > Don't know if it can have something to do with the bug but I mention > it for completeness. > > Now, I want to repeat the test with only one TPG and also get kdb > backtrace of rx/tx threads. Side question -- is there a way to play > with kdb remotely? I don't have physical access to the servers > today. > This is still not making sense to me for one important reason: See below.. > Martin > > <SNIP> > ---- end of san1 cleanup --- > > Sep 17 22:34:31 goose2 kernel: [11252.240833] Moving to TARG_CONN_STATE_CLEANUP_WAIT. > Sep 17 22:34:31 goose2 kernel: [11252.240836] Performing cleanup for failed iSCSI Connection ID: 0 from iqn.1998-01.com.vmware:hv-esx8 > Sep 17 22:34:31 goose2 kernel: [11252.240840] Closing iSCSI connection CID 0 on SID: 1 > Sep 17 22:34:31 goose2 kernel: [11252.240847] rx_loop: -512 total_rx: 0 > Sep 17 22:34:31 goose2 kernel: [11252.240895] Moving to TARG_CONN_STATE_FREE. > Sep 17 22:34:31 goose2 kernel: [11252.240898] Decremented iSCSI connection count to 0 from node: iqn.1998-01.com.vmware:hv-esx8 > Sep 17 22:34:31 goose2 kernel: [11252.240900] Moving to TARG_SESS_STATE_FAILED. > Sep 17 22:34:31 goose2 kernel: [11252.240934] Moving to TARG_SESS_STATE_FREE. > Sep 17 22:34:31 goose2 kernel: [11252.240937] Released iSCSI session from node: iqn.1998-01.com.vmware:hv-esx8 > Sep 17 22:34:31 goose2 kernel: [11252.240939] Decremented number of active iSCSI Sessions on iSCSI TPG: 1 to 1 > Sep 17 22:34:31 goose2 kernel: [11252.265021] Moving to TARG_CONN_STATE_CLEANUP_WAIT. > Sep 17 22:34:31 goose2 kernel: [11252.265028] Performing cleanup for failed iSCSI Connection ID: 0 from iqn.1998-01.com.vmware:hv-esx9 > Sep 17 22:34:31 goose2 kernel: [11252.265032] Closing iSCSI connection CID 0 on SID: 2 > Sep 17 22:34:31 goose2 kernel: [11252.265044] rx_loop: -512 total_rx: 0 > Sep 17 22:34:31 goose2 kernel: [11252.265079] Updated MaxCmdSN to 0x00e75fc3 > Sep 17 22:34:31 goose2 kernel: [11252.265083] [TCM] TGWFT: wait_for_tasks: Stopping ffff8800b8af8d00 ITT: 0x1f5de700 i_state: 1, t_state/def_t_state: 5/0, t_transport_stop = TRUE > Sep 17 22:34:32 goose2 kernel: [11252.501136] [TCM] TGWFT: wait_for_tasks: Stopped wait_for_compltion(&cmd->t_transport_stop_comp) for ITT: 0x1f5de700 > Sep 17 22:34:32 goose2 kernel: [11252.501149] Updated MaxCmdSN to 0x00e75fc4 > Sep 17 22:34:32 goose2 kernel: [11252.501152] [TCM] TGWFT: wait_for_tasks: Stopping ffff8800b7dfed00 ITT: 0x4c5fe700 i_state: 6, t_state/def_t_state: 5/0, t_transport_stop = TRUE > Sep 17 22:34:32 goose2 kernel: [11252.506286] [TCM] TGWFT: wait_for_tasks: Stopped wait_for_compltion(&cmd->t_transport_stop_comp) for ITT: 0x4c5fe700 > Sep 17 22:34:32 goose2 kernel: [11252.506766] Moving to TARG_CONN_STATE_FREE. > Sep 17 22:34:32 goose2 kernel: [11252.506770] Decremented iSCSI connection count to 0 from node: iqn.1998-01.com.vmware:hv-esx9 > Sep 17 22:34:32 goose2 kernel: [11252.506773] Moving to TARG_SESS_STATE_FAILED. > Sep 17 22:34:32 goose2 kernel: [11252.506784] Moving to TARG_SESS_STATE_FREE. > Sep 17 22:34:32 goose2 kernel: [11252.506788] Released iSCSI session from node: iqn.1998-01.com.vmware:hv-esx9 > Sep 17 22:34:32 goose2 kernel: [11252.506791] Decremented number of active iSCSI Sessions on iSCSI TPG: 1 to 0 > Sep 17 22:34:32 goose2 kernel: [11252.506796] Released 2 iSCSI Session(s) from Target Portal Group: 1 > Sep 17 22:34:32 goose2 kernel: [11252.506799] iSCSI_TPG[1] - Disabled iSCSI Target Portal Group At this point, the two active sessions from vmware:hv-esx8 and vmware:hv-esx9 have been successfully released w/ all active I/O properly shutdown and accounted for.. Eg: everything is functioning as expected, and the number of active sessions has been decremented to '0'. > Sep 17 22:34:32 goose2 kernel: [11252.506878] LIO_Target_ConfigFS: DEREGISTER -> iqn.2011-07.com.zoner:inetstore.vmw0.san2 TPGT: 1 PORTAL: 192.168.224.10:3260 > Sep 17 22:34:32 goose2 kernel: [11252.506883] CORE[iqn.2011-07.com.zoner:inetstore.vmw0.san2] - Removed Network Portal: 192.168.224.10:3260,1 on TCP > Sep 17 22:34:32 goose2 kernel: [11252.506921] CORE[0] - Removed Network Portal: 192.168.224.10:3260 on TCP > Sep 17 22:34:32 goose2 kernel: [11252.506923] LIO_Target_ConfigFS: delnpfromtpg done! > Sep 17 22:34:32 goose2 kernel: [11252.507388] [TCM] TCLFS: SE_LUN[9] - Setting cmd->transport_lun_stop for ITT: 0xc95fe700 > Sep 17 22:34:32 goose2 kernel: [11252.507392] [TCM] TCLFS: SE_LUN[9] - ITT: 0xc95fe700 before transport_lun_wait_for_tasks() > Sep 17 22:34:32 goose2 kernel: [11252.507395] [TCM] TSTFC: ITT[0xc95fe700] - Stopping tasks > Sep 17 22:34:32 goose2 kernel: [11252.507398] [TCM] TLWFT: ConfigFS: cmd: ffff8800b6eef7c0 t_tasks: 0 stop tasks ret: 0 > Sep 17 22:34:32 goose2 kernel: [11252.507401] [TCM] TLWFT: ConfigFS: ITT[0xc95fe700] - stopping cmd.... This is the really strange part.. So here we are removing the TPG LUN after all of the sessions have been shutdown, but there is still somehow a se_cmd descriptor in se_lun->lun_cmd_list (the offending se_cmd descriptor is ITT: 0xc95fe700, cmd ffff8800b6eef7c0) So looking at target_core_mod, I still don't see how releasing a se_cmd via the normal paths could ever avoid also removing itself from se_lun->lun_cmd_list. So at least for now pretty sure it's not in target_core_mod.. Looking at the incoming iscsi_target.c:iscsit_handle_scsi_cmd() path, I don't see any failures that could cause this scenario either.. Can you try reproducing again with the following to add a bit more context for the offending se_cmd descriptor in transport_lun_wait_for_tasks()..? Thanks, --nab diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 315790e..4f31e3d 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -4477,8 +4477,11 @@ static int transport_lun_wait_for_tasks(struct se_cmd *cmd, struct se_lun *lun) pr_debug("ConfigFS: cmd: %p t_tasks: %d stop tasks ret:" " %d\n", cmd, cmd->t_task_list_num, ret); if (!ret) { - pr_debug("ConfigFS: ITT[0x%08x] - stopping cmd....\n", - cmd->se_tfo->get_task_tag(cmd)); + pr_debug("ConfigFS: ITT[0x%08x] - stopping cmd.... fabric state: %d," + " t_state: %d, cmd->deferred_t_state: %d\n", + cmd->se_tfo->get_task_tag(cmd), + cmd->se_tfo->get_cmd_state(cmd), cmd->t_state, + cmd->deferred_t_state); wait_for_completion(&cmd->transport_lun_stop_comp); pr_debug("ConfigFS: ITT[0x%08x] - stopped cmd....\n", cmd->se_tfo->get_task_tag(cmd)); > Sep 17 22:37:00 goose2 kernel: [11400.770031] INFO: task liomtpg:17065 blocked for more than 120 seconds. > Sep 17 22:37:00 goose2 kernel: [11400.770034] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 17 22:37:00 goose2 kernel: [11400.770037] liomtpg D ffff88012ac2e500 0 17065 2651 0x00000004 > Sep 17 22:37:00 goose2 kernel: [11400.770042] ffff8800b723bc48 0000000000000086 ffff8800b723bb0a 0000001000000000 > Sep 17 22:37:00 goose2 kernel: [11400.770047] ffff88012ac2e290 00000000000116c0 ffff8800b723bfd8 ffff8800b723a010 > Sep 17 22:37:00 goose2 kernel: [11400.770052] ffff8800b723bfd8 00000000000116c0 ffff88012b385290 ffff88012ac2e290 > Sep 17 22:37:00 goose2 kernel: [11400.770057] Call Trace: > Sep 17 22:37:00 goose2 kernel: [11400.770066] [<ffffffff8103968a>] schedule+0x3a/0x60 > Sep 17 22:37:00 goose2 kernel: [11400.770071] [<ffffffff815587e5>] schedule_timeout+0x1b5/0x290 > Sep 17 22:37:00 goose2 kernel: [11400.770075] [<ffffffff8103bf1c>] ? set_cpus_allowed_ptr+0x7c/0x110 > Sep 17 22:37:00 goose2 kernel: [11400.770081] [<ffffffff81557b22>] wait_for_common+0xc2/0x160 > Sep 17 22:37:00 goose2 kernel: [11400.770084] [<ffffffff8103bd00>] ? try_to_wake_up+0x290/0x290 > Sep 17 22:37:00 goose2 kernel: [11400.770087] [<ffffffff81557c68>] wait_for_completion+0x18/0x20 > Sep 17 22:37:00 goose2 kernel: [11400.770107] [<ffffffffa003d1d1>] transport_clear_lun_from_sessions+0x51/0x80 [target_core_mod] > Sep 17 22:37:00 goose2 kernel: [11400.770118] [<ffffffffa003a93a>] core_tpg_post_dellun+0x2a/0x60 [target_core_mod] > Sep 17 22:37:00 goose2 kernel: [11400.770128] [<ffffffffa002ca35>] core_dev_del_lun+0x35/0xc0 [target_core_mod] > Sep 17 22:37:00 goose2 kernel: [11400.770132] [<ffffffff811133fe>] ? simple_unlink+0x4e/0x60 > Sep 17 22:37:00 goose2 kernel: [11400.770142] [<ffffffffa002fa3e>] target_fabric_port_unlink+0x4e/0x60 [target_core_mod] > Sep 17 22:37:00 goose2 kernel: [11400.770147] [<ffffffff8115a24a>] configfs_unlink+0xfa/0x1c0 > Sep 17 22:37:00 goose2 kernel: [11400.770151] [<ffffffff810fcbee>] vfs_unlink+0x8e/0x100 > Sep 17 22:37:00 goose2 kernel: [11400.770154] [<ffffffff81100133>] do_unlinkat+0x1a3/0x1e0 > Sep 17 22:37:00 goose2 kernel: [11400.770159] [<ffffffff810f1d3d>] ? vfs_write+0x12d/0x180 > Sep 17 22:37:00 goose2 kernel: [11400.770162] [<ffffffff810f1e7c>] ? sys_write+0x4c/0x90 > Sep 17 22:37:00 goose2 kernel: [11400.770165] [<ffffffff81100181>] sys_unlink+0x11/0x20 > Sep 17 22:37:00 goose2 kernel: [11400.770169] [<ffffffff8155a83b>] system_call_fastpath+0x16/0x1b > Sep 17 22:37:00 goose2 kernel: [11400.770172] INFO: task tcm_cl_9:17086 blocked for more than 120 seconds. > Sep 17 22:37:00 goose2 kernel: [11400.770175] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 17 22:37:00 goose2 kernel: [11400.770177] tcm_cl_9 D ffff88012ac92380 0 17086 2 0x00000000 > Sep 17 22:37:00 goose2 kernel: [11400.770181] ffff880106bffd60 0000000000000046 0000000000000002 06bffdc000000000 > Sep 17 22:37:00 goose2 kernel: [11400.770186] ffff88012ac92110 00000000000116c0 ffff880106bfffd8 ffff880106bfe010 > Sep 17 22:37:00 goose2 kernel: [11400.770190] ffff880106bfffd8 00000000000116c0 ffff88012b33d210 ffff88012ac92110 > Sep 17 22:37:00 goose2 kernel: [11400.770195] Call Trace: > Sep 17 22:37:00 goose2 kernel: [11400.770198] [<ffffffff8103968a>] schedule+0x3a/0x60 > Sep 17 22:37:00 goose2 kernel: [11400.770201] [<ffffffff815587e5>] schedule_timeout+0x1b5/0x290 > Sep 17 22:37:00 goose2 kernel: [11400.770205] [<ffffffff8103f470>] ? console_trylock+0x10/0x50 > Sep 17 22:37:00 goose2 kernel: [11400.770208] [<ffffffff81557b22>] wait_for_common+0xc2/0x160 > Sep 17 22:37:00 goose2 kernel: [11400.770212] [<ffffffff8103bd00>] ? try_to_wake_up+0x290/0x290 > Sep 17 22:37:00 goose2 kernel: [11400.770215] [<ffffffff81557c68>] wait_for_completion+0x18/0x20 > Sep 17 22:37:00 goose2 kernel: [11400.770226] [<ffffffffa003f3a6>] transport_clear_lun_thread+0x386/0x410 [target_core_mod] > Sep 17 22:37:00 goose2 kernel: [11400.770237] [<ffffffffa003f020>] ? transport_stop_tasks_for_cmd+0x1f0/0x1f0 [target_core_mod] > Sep 17 22:37:00 goose2 kernel: [11400.770242] [<ffffffff8105e636>] kthread+0x96/0xa0 > Sep 17 22:37:00 goose2 kernel: [11400.770246] [<ffffffff8155c634>] kernel_thread_helper+0x4/0x10 > Sep 17 22:37:00 goose2 kernel: [11400.770250] [<ffffffff8105e5a0>] ? kthread_worker_fn+0x130/0x130 > Sep 17 22:37:00 goose2 kernel: [11400.770253] [<ffffffff8155c630>] ? gs_change+0xb/0xb > > -- > To unsubscribe from this list: send the line "unsubscribe target-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html