On Fri, 2011-09-16 at 21:39 +0200, Martin Svec wrote: > Hello, > > I tried my 2.6.39 LIO stress tests with the latest mainline 3.1-rc6+ code and > with one of the tests, I sometimes hit an endless wait in transport_clear_lun_from_sessions() > coupled with a wait in transport_clear_lun_thread(). > > The affected test repeatedly adds+enables and disables+removes a TPG while an initiator > is connected to it and performs heavy sequential writes using fio. The problem occurs during > TPG removal in unlink("/sys/kernel/config/target/iscsi/<target-iqn>/<tpgt#>/lun/<lun#>/<dev-symlink>"). > When entering the unlink(), TPG is already disabled by "echo 0 > .../enable" and all initiator > ACLs are removed. It does not happen every time but I can almost always reproduce it within > 10-20 minutes. > > Tested on bleeding edge mainline 3.1-rc6+ with today's target patches for -rc7. > Hi Martin, Thanks for the detailed bug-report. It sounds like you are pretty certain this is regression with the v3.1 target-core+iscsi-target mainline code, yes..? Have you been able to verify this same test with the pre v3.1 (eg: v4.0.0-rc7) code in lio-core-2.6.git..? Nothing immediately rings a bell here, but i'll have a deeper look and try reproducing over the weekend.. Also, it would be very helpful if you could enable the dynamic_printk right before the shutdown sequence with: echo 'module iscsi_target_mod +p' > /debug/dynamic_debug/control and disable it before re-enabling the TPG to avoid all of the WRITE I/O output noise with: echo 'module iscsi_target_mod -p' > /debug/dynamic_debug/control This will require CONFIG_DYNAMIC_DEBUG=y in your kernel config, but would be very helpful to diagnose the issue. --nab > Call trace: > > Sep 16 17:45:18 goose1 kernel: [ 8877.167113] INFO: task liomtpg:27686 blocked for more than 120 seconds. > Sep 16 17:45:18 goose1 kernel: [ 8877.167117] "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 16 17:45:18 goose1 kernel: [ 8877.167120] liomtpg D ffff880321d68300 0 27686 2841 0x00000004 > Sep 16 17:45:18 goose1 kernel: [ 8877.167126] ffff8803056fdc58 0000000000000082 ffff8803056fdb0a 0000001000000000 > Sep 16 17:45:18 goose1 kernel: [ 8877.167133] ffff880321d68090 00000000000116c0 ffff8803056fdfd8 ffff8803056fc010 > Sep 16 17:45:18 goose1 kernel: [ 8877.167137] ffff8803056fdfd8 00000000000116c0 ffff88032275f2d0 ffff880321d68090 > Sep 16 17:45:18 goose1 kernel: [ 8877.167143] Call Trace: > Sep 16 17:45:18 goose1 kernel: [ 8877.167157] [<ffffffff810394ca>] schedule+0x3a/0x60 > Sep 16 17:45:18 goose1 kernel: [ 8877.167165] [<ffffffff81543475>] schedule_timeout+0x1b5/0x290 > Sep 16 17:45:18 goose1 kernel: [ 8877.167170] [<ffffffff8103bd5c>] ? set_cpus_allowed_ptr+0x7c/0x110 > Sep 16 17:45:18 goose1 kernel: [ 8877.167176] [<ffffffff815427b2>] wait_for_common+0xc2/0x160 > Sep 16 17:45:18 goose1 kernel: [ 8877.167180] [<ffffffff8103bb40>] ? try_to_wake_up+0x290/0x290 > Sep 16 17:45:18 goose1 kernel: [ 8877.167183] [<ffffffff815428f8>] wait_for_completion+0x18/0x20 > Sep 16 17:45:18 goose1 kernel: [ 8877.167206] [<ffffffffa00148d1>] transport_clear_lun_from_sessions+0x51/0x80 [target_core_mod] > Sep 16 17:45:18 goose1 kernel: [ 8877.167219] [<ffffffffa00123ea>] core_tpg_post_dellun+0x2a/0x60 [target_core_mod] > Sep 16 17:45:18 goose1 kernel: [ 8877.167229] [<ffffffffa000602c>] core_dev_del_lun+0x2c/0x70 [target_core_mod] > Sep 16 17:45:18 goose1 kernel: [ 8877.167235] [<ffffffffa0008a7e>] target_fabric_port_unlink+0x4e/0x60 [target_core_mod] > Sep 16 17:45:18 goose1 kernel: [ 8877.167241] [<ffffffff8115925a>] configfs_unlink+0xfa/0x1c0 > Sep 16 17:45:18 goose1 kernel: [ 8877.167245] [<ffffffff810fc75e>] vfs_unlink+0x8e/0x100 > Sep 16 17:45:18 goose1 kernel: [ 8877.167248] [<ffffffff810ffca3>] do_unlinkat+0x1a3/0x1e0 > Sep 16 17:45:18 goose1 kernel: [ 8877.167251] [<ffffffff810f18ad>] ? vfs_write+0x12d/0x180 > Sep 16 17:45:18 goose1 kernel: [ 8877.167253] [<ffffffff810f19ec>] ? sys_write+0x4c/0x90 > Sep 16 17:45:18 goose1 kernel: [ 8877.167255] [<ffffffff810ffcf1>] sys_unlink+0x11/0x20 > Sep 16 17:45:18 goose1 kernel: [ 8877.167260] [<ffffffff815454fb>] system_call_fastpath+0x16/0x1b > Sep 16 17:45:18 goose1 kernel: [ 8877.167262] INFO: task tcm_cl_9:27714 blocked for more than 120 seconds. > Sep 16 17:45:18 goose1 kernel: [ 8877.167263] "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 16 17:45:18 goose1 kernel: [ 8877.167265] tcm_cl_9 D ffff8803224a62c0 0 27714 2 0x00000000 > Sep 16 17:45:18 goose1 kernel: [ 8877.167267] ffff880314c7dd60 0000000000000046 0000000000000000 0000000000000000 > Sep 16 17:45:18 goose1 kernel: [ 8877.167270] ffff8803224a6050 00000000000116c0 ffff880314c7dfd8 ffff880314c7c010 > Sep 16 17:45:18 goose1 kernel: [ 8877.167273] ffff880314c7dfd8 00000000000116c0 ffff8803227a9350 ffff8803224a6050 > Sep 16 17:45:18 goose1 kernel: [ 8877.167276] Call Trace: > Sep 16 17:45:18 goose1 kernel: [ 8877.167278] [<ffffffff810394ca>] schedule+0x3a/0x60 > Sep 16 17:45:18 goose1 kernel: [ 8877.167280] [<ffffffff81543475>] schedule_timeout+0x1b5/0x290 > Sep 16 17:45:18 goose1 kernel: [ 8877.167282] [<ffffffff8103ba90>] ? try_to_wake_up+0x1e0/0x290 > Sep 16 17:45:18 goose1 kernel: [ 8877.167284] [<ffffffff815427b2>] wait_for_common+0xc2/0x160 > Sep 16 17:45:18 goose1 kernel: [ 8877.167286] [<ffffffff8103bb40>] ? try_to_wake_up+0x290/0x290 > Sep 16 17:45:18 goose1 kernel: [ 8877.167289] [<ffffffff815428f8>] wait_for_completion+0x18/0x20 > Sep 16 17:45:18 goose1 kernel: [ 8877.167295] [<ffffffffa0016665>] transport_clear_lun_thread+0x295/0x340 [target_core_mod] > Sep 16 17:45:18 goose1 kernel: [ 8877.167302] [<ffffffffa00163d0>] ? transport_stop_tasks_for_cmd+0x170/0x170 [target_core_mod] > Sep 16 17:45:18 goose1 kernel: [ 8877.167308] [<ffffffff8105e476>] kthread+0x96/0xa0 > Sep 16 17:45:18 goose1 kernel: [ 8877.167313] [<ffffffff815472f4>] kernel_thread_helper+0x4/0x10 > Sep 16 17:45:18 goose1 kernel: [ 8877.167315] [<ffffffff8105e3e0>] ? kthread_worker_fn+0x130/0x130 > Sep 16 17:45:18 goose1 kernel: [ 8877.167317] [<ffffffff815472f0>] ? gs_change+0xb/0xb > > > The process that calls unlink() blocks on lun->lun_shutdown_comp: > > (gdb) list *(transport_clear_lun_from_sessions+0x51) > 0x148d1 is in transport_clear_lun_from_sessions (drivers/target/target_core_transport.c:4608). > 4603 "tcm_cl_%u", lun->unpacked_lun); > 4604 if (IS_ERR(kt)) { > 4605 pr_err("Unable to start clear_lun thread\n"); > 4606 return PTR_ERR(kt); > 4607 } > 4608 wait_for_completion(&lun->lun_shutdown_comp); > 4609 > 4610 return 0; > 4611 } > 4612 > > > tcm_cl thread blocks on cmd->transport_lun_stop_comp: > > (gdb) list *(transport_clear_lun_thread+0x295) > 0x16665 is in transport_clear_lun_thread (drivers/target/target_core_transport.c:4479). > 4474 " %d\n", cmd, cmd->t_task_list_num, ret); > 4475 if (!ret) { > 4476 pr_debug("ConfigFS: ITT[0x%08x] - stopping cmd....\n", > 4477 cmd->se_tfo->get_task_tag(cmd)); > 4478 wait_for_completion(&cmd->transport_lun_stop_comp); > 4479 pr_debug("ConfigFS: ITT[0x%08x] - stopped cmd....\n", > 4480 cmd->se_tfo->get_task_tag(cmd)); > 4481 } > 4482 transport_remove_cmd_from_queue(cmd,&cmd->se_dev->dev_queue_obj); > 4483 > > > Though I use my own library to manage configfs objects, the sequence of changes in configfs should > be the same as in lio-utils. Below is my log of configfs operations that forego the bug (last line > corresponds to the frozen unlink): > > Sep 16 18:54:48 goose1 LIO: 8969 Disabling TPG iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1 > Sep 16 18:54:48 goose1 LIO: 8969 Deleting TPG network portal /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/np/192.168.224.10:3260 > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator LUN ACL mapping /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.2005-09.com.zoner.goosem:openiscsi-initiator/lun_9/lio_lun > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator LUN ACL /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.2005-09.com.zoner.goosem:openiscsi-initiator/lun_9 > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator ACL dir /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.2005-09.com.zoner.goosem:openiscsi-initiator > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator LUN ACL mapping /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.1998-01.com.vmware:hv-esx9/lun_9/lio_lun > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator LUN ACL /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.1998-01.com.vmware:hv-esx9/lun_9 > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator ACL dir /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.1998-01.com.vmware:hv-esx9 > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator LUN ACL mapping /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.1998-01.com.vmware:hv-esx8/lun_9/lio_lun > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator LUN ACL /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.1998-01.com.vmware:hv-esx8/lun_9 > Sep 16 18:54:48 goose1 LIO: 8969 Deleting Initiator ACL dir /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/acls/iqn.1998-01.com.vmware:hv-esx8 > Sep 16 18:54:48 goose1 LIO: 8969 Deleting TPG LUN link /sys/kernel/config/target/iscsi/iqn.2011-07.com.zoner:inetstore.vmw0.san2/tpgt_1/lun/lun_9/cpg0_lun9-lun9 > > > Martin > > > -- > To unsubscribe from this list: send the line "unsubscribe target-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html