Hello Nab, > If / when your able to reproduce, please make sure to enable the > dynamic debugging for iscsi_target_mod after it's triggered to see > what's going on.. I see, yesterday, I enabled the debugging before trying to triggering the incident. I had 1G of syslog output. > Also as mentioned earlier, the original logs indicate that the target > was explicitly shutdown + modules unloaded (and not restarted) almost > immediately after the ABORT_TASKs where received, and no other errors > / exceptions where reported. You where able to confirm that shutdown > and non restart was expected, right..? I checked the logs. The first log message I can see is here: Oct 11 11:53:56 node-62 kernel: [219465.151250] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5488 I started the evaluation at this exact time (I always download the evaluation slides from my webserver directly before starting the evaluation because they contain a password which is only valid for 60 minutes to do the online evaluation of the class). 176.94.62.170 - - [11/Oct/2013:12:06:16 +0200] "GET /xxx/xxx.pdf HTTP/1.1" 200 43168 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36" "thomas.glanzmann.de" Between 12:06:16 and 12:14:51 the participants must have complained about non responding ESX servers, so I checked the serial console and the output. And I remember seeing: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5488 and TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000001d So I wanted to have the target back as soon as possible, so I restarted it at this exact time: Oct 11 12:14:39 node-62 shutdown[9433]: shutting down for system reboot Logs from the switch that the ports went down 12 seconds after I typed in 'reboot': (infra) [/var/adm/syslog/2013/10/11] grep 'procurve-04 ports: port 23' local6 Oct 11 12:14:51 procurve-04 ports: port 23 in Trk10 is now off-line Oct 11 12:14:53 procurve-04 ports: port 23 is Blocked by LACP Oct 11 12:14:59 procurve-04 ports: port 23 in Trk10 is now on-line Oct 11 12:15:03 procurve-04 ports: port 23 in Trk10 is now off-line Oct 11 12:15:07 procurve-04 ports: port 23 is Blocked by LACP Oct 11 12:15:44 procurve-04 ports: port 23 in Trk10 is now off-line Oct 11 12:15:49 procurve-04 ports: port 23 is Blocked by LACP Oct 11 12:15:50 procurve-04 ports: port 23 in Trk10 is now on-line What is bothering me is I not only see TMR_TASK_DOES_NOT_EXIST but also 'Detected NON_EXISTENT_LUN Access' so for me it looks like the target forgot about the LUNs it had configuring _without_ me doing anything. So the target was not started immediately but was sitting in the state for around 20 minutes from 11:53:56 till 12:14:39. But I think it was fail operational because if you loose access to all paths of a LUN (APD (All Paths Down)) and work with an ESX server you notice immediately because tasks don't do any progress anylonger and everything becomes slugish (no response to commands given). Participants would have complained to me earlier. We had configured multipathing in round robin that means we had 12 targets; 5 devices; 20 paths (4 per device over 2 portals using two initiators). See also PDF page 28 labeled 'iscsi multipathing' for the setup. https://thomas.glanzmann.de/tmp/whiteboard.pdf The two demo mode LUNs were on one target each. The 3 private to each ESX servers LUNs were together on one target. After the participants reported to me, that they had a problem I checked dmesg and the serial console of the target and saw the ABORT_TASKs and 'NON_EXISTENT_LUN Access' and typed in 'reboot'. After the reboot everything was back to normal. However we left for lunch break and only started working with the systems one hour later. But at that point all the ongoing tasks in vCenter had timed out and we were able to continue working. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html