Hi Himanshu & Co, On Fri, 2016-02-12 at 00:48 -0800, Nicholas A. Bellinger wrote: > On Fri, 2016-02-12 at 05:30 +0000, Himanshu Madhani wrote: <SNIP> > Thanks for the crash dump output. > > So it's a t_state = TRANSPORT_WRITE_PENDING descriptor with > SAM_STAT_CHECK_CONDITION + cmd_kref.refcount = 0: > > struct qla_tgt_cmd { > se_cmd = { > scsi_status = 0x2 > se_cmd_flags = 0x80090d, > > <SNIP> > > cmd_kref = { > refcount = { > counter = 0x0 > } > }, > } > > The se_cmd_flags=0x80090d translation to enum se_cmd_flags_table: > > - SCF_TRANSPORT_TASK_SENSE > - SCF_EMULATED_TASK_SENSE > - SCF_SCSI_DATA_CDB > - SCF_SE_LUN_CMD > - SCF_SENT_CHECK_CONDITION > - SCF_USE_CPUID > After groking your dump some more: For SAM_STAT_CHECK_CONDITION with t_state = TRANSPORT_WRITE_PENDING plus se_cmd->transport_state = 0x880 bits set, is: - CMD_T_DEV_ACTIVE - CMD_T_FABRIC_STOP and sense buffer = 0x70 00 0b 00 00 00 00 0a 00 00 00 00 29 03 00, which is the following from sense_info_table[]: [TCM_CHECK_CONDITION_ABORT_CMD] = { .key = ABORTED_COMMAND, .asc = 0x29, /* BUS DEVICE RESET FUNCTION OCCURRED */ .ascq = 0x03, }, The descriptor looks like it did make it to tcm_qla2xxx_complete_free() -> transport_generic_free_cmd() with both qla_tgt_cmd->cmd_sent_to_fw=0, and qla_tgt_cmd->write_data_transferred=0 set. The best I can tell, it looks like tcm_qla2xxx_handle_data_work() -> transport_generic_request_failure() w/ TCM_CHECK_CONDITION_ABORT_CMD is occurring.. So to confirm, this specific bug was not a result of active I/O LUN_RESET w/ CMD_T_ABORTED during session disconnect, or otherwise. > > > I can recreate this issue at will within 5 minute of triggering sg_reset > > with following steps > > > > 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8 > > RAM disk targets > > 2. Start IO with 4K block size and 8 threads with 80% write 20% read and > > 100% dandom. > > (I am using vdbench for generating IO. I can provide setup/config script > > if needed) > > 3. Start sg_reset for each LUNs with first device, bus and host with 120s > > delay. (I¹ve attached > > My script that I am using for triggering sg_reset) > > > > Thanks, will keep looking and try to reproduce with your script. So here's my test setup with 3x Intel P3600 NVMe/IBLOCK backends, across dual ISP2532 ports: o- / ......................................................................................... [...] o- backstores .............................................................................. [...] | o- fileio ................................................................... [0 Storage Object] | o- iblock .................................................................. [3 Storage Objects] | | o- nvme0n1 ............................................................ [/dev/nvme0n1, in use] | | o- nvme1n1 ............................................................ [/dev/nvme1n1, in use] | | o- nvme2n1 ............................................................ [/dev/nvme2n1, in use] | o- pscsi .................................................................... [0 Storage Object] | o- rd_mcp ................................................................... [1 Storage Object] | o- ramdisk ...................................................... [16.0G, ramdisk, not in use] o- qla2xxx ........................................................................... [2 Targets] | o- 21:00:00:24:ff:48:97:7e ........................................................... [enabled] | | o- acls .............................................................................. [1 ACL] | | | o- 21:00:00:24:ff:48:97:7c ................................................. [3 Mapped LUNs] | | | o- mapped_lun0 ............................................................... [lun0 (rw)] | | | o- mapped_lun1 ............................................................... [lun1 (rw)] | | | o- mapped_lun2 ............................................................... [lun2 (rw)] | | o- luns ............................................................................. [3 LUNs] | | o- lun0 .................................................... [iblock/nvme0n1 (/dev/nvme0n1)] | | o- lun1 .................................................... [iblock/nvme1n1 (/dev/nvme1n1)] | | o- lun2 .................................................... [iblock/nvme2n1 (/dev/nvme2n1)] | o- 21:00:00:24:ff:48:97:7f ........................................................... [enabled] | o- acls .............................................................................. [1 ACL] | | o- 21:00:00:24:ff:48:97:7d ................................................. [3 Mapped LUNs] | | o- mapped_lun0 ............................................................... [lun0 (rw)] | | o- mapped_lun1 ............................................................... [lun1 (rw)] | | o- mapped_lun2 ............................................................... [lun2 (rw)] | o- luns ............................................................................. [3 LUNs] | o- lun0 .................................................... [iblock/nvme0n1 (/dev/nvme0n1)] | o- lun1 .................................................... [iblock/nvme1n1 (/dev/nvme1n1)] | o- lun2 .................................................... [iblock/nvme2n1 (/dev/nvme2n1)] Attached is the fio write-verify workload for reference. Also, a few changes made to your test script: - Use sg_reset -H (-h is help :) for host reset op - Wait sleep 10 between calls instead of 2 mins - Limit sg_reset SCSI device list to 3x remote-ports The last one is to verify with various sg_reset ops across remote-ports only, separate from any existing active I/O session disconnect bugs that may exist beyond this specific PATCH-v4 series. To that end, after verifying tonight with 100x iterations of your script with the above changes, fio write-verify is still functioning as expected with remote-port only sg_reset ops using NVMe/IBLOCK backends. So that said, I'll be pushing what's in target-pending/master as -rc4 code, and continue to debug the hung task as a separate active I/O shutdown related issue. Thanks again for your help, --nab
[global] thread=1 blocksize_range=4k-256k direct=1 ioengine=libaio verify=crc32c-intel verify_interval=512 iodepth=32 size=1000G loops=100 numjobs=1 invalidate=0 filename=/dev/sdb filename=/dev/sdc filename=/dev/sdd [verify] rw=randrw do_verify=1