On Fri, 2016-02-12 at 05:30 +0000, Himanshu Madhani wrote: > Hi Nic, > > > > On 2/11/16, 3:47 PM, "Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> wrote: > > >On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote: > >> On Tue, 2016-02-09 at 18:03 +0000, Himanshu Madhani wrote: > >> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> > >>wrote: > >> > >On Mon, 2016-02-08 at 23:27 +0000, Himanshu Madhani wrote: > >> > >> > >> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am > >>seeing > >> > >>issue > >> > >> where trying to trigger > >> > >> sg_reset with option of host/device/bus in loop at 120second > >>interval > >> > >> causes call stack. At this point > >> > >> removing configuration hangs indefinitely. See attached dmesg > >>output > >> > >>from > >> > >> my setup. > >> > >> > >> > > > >> > >Thanks alot for testing this. > >> > > > >> > >So It looks like we're still hitting a indefinite schedule() on > >> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect > >> > >occurs, after repeated explicit active I/O remote-port sg_resets. > >> > > > >> > >Does this trigger on the first tcm_qla2xxx session reconnect after > >> > >explicit remote-port sg_reset..? Are session reconnects actively > >>being > >> > >triggered during the test..? > >> > > > >> > >To verify the latter for iscsi-target, I've been using a small patch > >>to > >> > >trigger session reset from TMR kthread context in order to simulate > >>the > >> > >I_T disconnects. Something like that would be useful for verifying > >>with > >> > >tcm_qla2xxx too. > >> > > > >> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and > >> > >will enable various debug in a WIP branch for testing. > >> > >> Following up here.. > >> > >> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and > >> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has > >> been functioning as expected with a blocksize_range=4k-256k + iodepth=32 > >> fio write-verify style workload. > >> > >> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from > >> outstanding target TAS responses, nor fio write-verify failures to > >> report after 800x remote-port active I/O LUN_RESETS. > >> > >> Next step will be to verify explicit tcm_qla2xxx port + module shutdown > >> after 1K test iterations, and then IBLOCK async completions <-> NVMe > >> backends with the same case. > >> > > > >After letting this test run over-night up to 7k active I/O remote-port > >LUN_RESETs, things are still functioning as expected. > > > >Also, /etc/init.d/target stop was able to successfully shutdown all > >active sessions and unload tcm_qla2xxx after the test run. > > > >So AFAICT, the active I/O remote-port LUN_RESET changes are stable with > >tcm_qla2xxx ports, separate from concurrent session disconnect hung task > >you reported earlier. > > > >That said, I'll likely push this series as-is for -rc4, given that Dan > >has also been able to verify the non conncurrent session disconnect case > >on his setup generating constant ABORT_TASKs, and it's still surviving > >both cases for iscsi-target ports. > > > >Please give the debug patch from last night a shot, and see if we can > >determine the se_cmd states when you hit the hung task. > > I¹ll give your latest debug patch try in a little while > > From the testing that I have done, what is seen is that active IO has > already been completed and qla2xxx driver is waiting for commands to be > Completed and it¹s waiting indefinitely for cmd_wait_comp. > So it looks like there is a missing complete call from target_core. I¹ve > attached our analysis from crash debug on a live system after the issues > happens. > > Thanks for the crash dump output. So it's a t_state = TRANSPORT_WRITE_PENDING descriptor with SAM_STAT_CHECK_CONDITION + cmd_kref.refcount = 0: struct qla_tgt_cmd { se_cmd = { scsi_status = 0x2 se_cmd_flags = 0x80090d, <SNIP> cmd_kref = { refcount = { counter = 0x0 } }, } The se_cmd_flags=0x80090d translation to enum se_cmd_flags_table: - SCF_TRANSPORT_TASK_SENSE - SCF_EMULATED_TASK_SENSE - SCF_SCSI_DATA_CDB - SCF_SE_LUN_CMD - SCF_SENT_CHECK_CONDITION - SCF_USE_CPUID Also, ->cmd_wait_comp has a zero rlock counter + dead magic: cmd_wait_comp = { done = 0x0, wait = { lock = { { rlock = { raw_lock = { val = { counter = 0x0 } }, magic = 0xdead4ead, owner_cpu = 0xffffffff, owner = 0xffffffffffffffff, > I can recreate this issue at will within 5 minute of triggering sg_reset > with following steps > > 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8 > RAM disk targets > 2. Start IO with 4K block size and 8 threads with 80% write 20% read and > 100% dandom. > (I am using vdbench for generating IO. I can provide setup/config script > if needed) > 3. Start sg_reset for each LUNs with first device, bus and host with 120s > delay. (I¹ve attached > My script that I am using for triggering sg_reset) > Thanks, will keep looking and try to reproduce with your script. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html