On Tue, Jun 3, 2014 at 10:37 PM, Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> wrote: > On Sat, 2014-05-31 at 13:03 +0300, Charalampos Pournaris wrote: >> On Sun, May 25, 2014 at 9:42 AM, Thomas Glanzmann <thomas@xxxxxxxxxxxx> wrote: >> > Hallo Nab, >> > >> >> Usually spreading out the LUNs across multiple endpoints (eg: multiple >> >> IQNs) helps to reduce or completely eliminate the ESX host side issues. >> >> Typically you want to keep it around 4-5 LUNs per endpoint on 1 Gb/sec >> >> ports. >> > >> > in my case I had 3 LUNs per endpoint. With rts/os with the same config >> > it worked. But as soon as my time permits it, I'll hunt it down. >> > >> > Cheers, >> > Thomas >> >> Hi again, >> >> I managed to enable dynamic debugging on the Debian VMs as Thomas >> suggested and I've got some logs as the problem manifested itself >> again. More specifically, I started seeing datastores losing >> connection to the attached hosts and at some point the VM that hosts >> the iSCSI target crashed with a kernel panic. >> >> I've got an enormous amount of logs but I'm not quite sure which ones >> would more helpful to you: kern.log, debug, syslog, messages or all of >> them? All of them gziped have a size of around 700 mb (unzipped each >> one is around 2 GB). Would you be able to have a look at such big >> files? (I guess by targeting some specific output) >> >> By the way, I had also created multiple IQNs (one for each host >> accessing the LUN) as per Nicholas' suggestion (if I got it right). It >> can be seen below: >> > > <SNIP> > > The configuration looks fine. Thanks for including the extra info.. > >> >> If for some reason the formatting is not displayed properly, or for >> better readability check this screenshot: >> http://postimg.org/image/e358ov40r/full/ >> >> Thank you in advance for your help. >> > > Ok, so looking at these logs it's apparent that there are significantly > fewer occurrences of ABORT_TASK. In fact, AFAICT there is only a single > occurrence of ABORT_TASK in the entire log. > > This could be attributed to the reconfiguration to use a single LUN per > endpoint, to avoid the false positive timeout issues that ESX is known > to generate with multiple LUNs per TargetName+TargetPortalGroupTag > endpoint.. > > However, looking at the single instance of ABORT_TASK in the log, > something else appears to be happening with your backend: > > May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032235] Got Task Management Request ITT: 0x0027a82c, CmdSN: 0x3da72700, Function: 0x01, RefTaskTag: 0x0027a814, RefCmdSN: 0x25a72700, CID: 0 > May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032266] ABORT_TASK: Found referenced iSCSI task_tag: 2598932 > May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032271] wait_for_tasks: Stopping ffff8800bac15810 ITT: 0x0027a814 i_state: 6, t_state: 5, CMD_T_STOP > > The most interesting line is the last one wrt to wait_for_tasks.. > > Decoded, these i_state and t_state values mean: > > i_state: 6 (ISTATE_RECEIVED_LAST_DATAOUT) > t_state: 5 (TRANSPORT_PROCESSING) > > The significance of the 'TRANSPORT_PROCESSING' t_state means that an I/O > request was dispatched to the backend (iblock/24378_iscsi), but the > underlying storage never completes the outstanding I/O back to the > target layer. Or at least, this occurrence of ABORT_TASK is right near > the end of the logs, and there is no debug output to indicate the I/O > completion ever occurs. > > This usually means some type of problem with the underlying driver for > the backend storage, as there is not a legitimate why outstanding I/Os > would not be (eventually) completed back to IBLOCK, be it with GOOD or > some manner of exception status. > > So that said, I would start investigating the underlying LLD driver for > iblock/24378_iscsi (/dev/sdb).. What type of storage + LLD is it > using..? Is the HBA using the latest available firmware..? Is there > anything else special about this backend..? > > --nab > Hi Nicholas, First of all thanks for the detailed explanation. It seems that this last problem that we've hit is different from the one reported initially since it caused a kernel panic whereas the other issue makes the datastore(s) inactive, constantly throwing a login error. By the way, I'm using linux VMs (debian) to expose the iSCSI datastores (i.e. /dev/sdb is a local drive) and thus it doesn't use any special hardware/firmware. The only possible issue I can think of is the vmware-tools driver might be the culprit for the incomplete I/O, or some kernel driver? I've hit again the issue initially reported and have fresh logs to share. I'll send a separate mail with a link to the new logs. Additionally, when I attempted to stop the target service, it got stuck: ------------------------------ -------------------------------------------------------------------------------- root@sof-24379-iscsi-vm:/var/log# /etc/init.d/target stop Unloading fabric/configfs: Successfully released fabric: /sys/kernel/config/target/srpt Successfully released fabric: /sys/kernel/config/target/qla2xxx Successfully released fabric: /sys/kernel/config/target/loopback Successfully released fabric: /sys/kernel/config/target/fc [OK] Unloading LIO-Target/ConfigFS fabric: Successfully released network portal: 10.23.84.25:3260 created iqn.2003-01.org.linux-iscsi.sof-24379-iscsi-vm.x8664:sn.6d8c93856c59 TPGT: 1 Successfully deleted iSCSI Initiator Mapped LUN: 0 ACL iqn.1998-01.com.vmware:sof-24378-srv-4e00cca4 for iSCSI Target Portal Group: iqn.2003-01.org.linux-iscsi.sof-24379-iscsi-vm.x8664:sn.6d8c93856c59 1 Successfully deleted iSCSI Initaitor ACL iqn.1998-01.com.vmware:sof-24378-srv-4e00cca4 for iSCSI Target Portal Group: iqn.2003-01.org.linux-iscsi.sof-24379-iscsi-vm.x8664:sn.6d8c93856c59 1 Successfully deleted iSCSI Target Logical Unit Successfully released iSCSI Target Portal Group: iqn.2003-01.org.linux-iscsi.sof-24379-iscsi-vm.x8664:sn.6d8c93856c59 TPGT: 1 Successfully released iSCSI Target Endpoint IQN: iqn.2003-01.org.linux-iscsi.sof-24379-iscsi-vm.x8664:sn.6d8c93856c59 <stuck at this point> -------------------------------------------------------------------------------------------------------------- Below you can see some of the ABORT tasks in these new logs: -------------------------------------------------------------------------------------------------------------- .... Jun 10 11:15:40 sof-24379-iscsi-vm kernel: [843336.907166] ABORT_TASK: Found referenced iSCSI task_tag: 7383010 Jun 10 11:15:40 sof-24379-iscsi-vm kernel: [843337.045742] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7383010 Jun 10 11:16:09 sof-24379-iscsi-vm kernel: [843366.085476] ABORT_TASK: Found referenced iSCSI task_tag: 14376053 Jun 10 11:16:09 sof-24379-iscsi-vm kernel: [843366.114603] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 14376053 Jun 10 11:16:09 sof-24379-iscsi-vm kernel: [843366.114637] ABORT_TASK: Found referenced iSCSI task_tag: 14376065 Jun 10 11:16:09 sof-24379-iscsi-vm kernel: [843366.114640] ABORT_TASK: ref_tag: 14376065 already complete, skipping Jun 10 11:16:09 sof-24379-iscsi-vm kernel: [843366.114641] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 14376065 Jun 10 11:16:10 sof-24379-iscsi-vm kernel: [843367.116890] ABORT_TASK: Found referenced iSCSI task_tag: 18861488 Jun 10 11:16:10 sof-24379-iscsi-vm kernel: [843367.116892] ABORT_TASK: ref_tag: 18861488 already complete, skipping Jun 10 11:16:10 sof-24379-iscsi-vm kernel: [843367.116895] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 18861488 Jun 10 11:16:10 sof-24379-iscsi-vm kernel: [843367.116903] ABORT_TASK: Found referenced iSCSI task_tag: 18861490 Jun 10 11:16:10 sof-24379-iscsi-vm kernel: [843367.122593] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 18861490 Jun 10 11:16:13 sof-24379-iscsi-vm kernel: [843369.669737] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7383497 Jun 10 11:16:13 sof-24379-iscsi-vm kernel: [843369.669747] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7383481 Jun 10 11:16:13 sof-24379-iscsi-vm kernel: [843369.669752] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7383486 Jun 10 11:17:12 sof-24379-iscsi-vm kernel: [843429.355248] ABORT_TASK: Found referenced iSCSI task_tag: 7383515 Jun 10 11:17:12 sof-24379-iscsi-vm kernel: [843429.355255] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7383515 Jun 10 11:17:12 sof-24379-iscsi-vm kernel: [843429.355259] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7383515 Jun 10 11:17:12 sof-24379-iscsi-vm kernel: [843429.355261] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7383515 Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.298874] ABORT_TASK: Found referenced iSCSI task_tag: 18890483 Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.298883] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 18890483 Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.298888] ABORT_TASK: Found referenced iSCSI task_tag: 18890486 Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.347128] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 18890486 Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.347134] ABORT_TASK: Found referenced iSCSI task_tag: 18890491 Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.347136] ABORT_TASK: ref_tag: 18890491 already complete, skipping Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.347138] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 18890491 Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.347141] ABORT_TASK: Found referenced iSCSI task_tag: 18890496 Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.347142] ABORT_TASK: ref_tag: 18890496 already complete, skipping Jun 10 11:26:43 sof-24379-iscsi-vm kernel: [844000.347144] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 18890496 ... -------------------------------------------------------------------------------------------------------------- Greetings, Harry -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html