On Sat, 2014-05-31 at 13:03 +0300, Charalampos Pournaris wrote: > On Sun, May 25, 2014 at 9:42 AM, Thomas Glanzmann <thomas@xxxxxxxxxxxx> wrote: > > Hallo Nab, > > > >> Usually spreading out the LUNs across multiple endpoints (eg: multiple > >> IQNs) helps to reduce or completely eliminate the ESX host side issues. > >> Typically you want to keep it around 4-5 LUNs per endpoint on 1 Gb/sec > >> ports. > > > > in my case I had 3 LUNs per endpoint. With rts/os with the same config > > it worked. But as soon as my time permits it, I'll hunt it down. > > > > Cheers, > > Thomas > > Hi again, > > I managed to enable dynamic debugging on the Debian VMs as Thomas > suggested and I've got some logs as the problem manifested itself > again. More specifically, I started seeing datastores losing > connection to the attached hosts and at some point the VM that hosts > the iSCSI target crashed with a kernel panic. > > I've got an enormous amount of logs but I'm not quite sure which ones > would more helpful to you: kern.log, debug, syslog, messages or all of > them? All of them gziped have a size of around 700 mb (unzipped each > one is around 2 GB). Would you be able to have a look at such big > files? (I guess by targeting some specific output) > > By the way, I had also created multiple IQNs (one for each host > accessing the LUN) as per Nicholas' suggestion (if I got it right). It > can be seen below: > <SNIP> The configuration looks fine. Thanks for including the extra info.. > > If for some reason the formatting is not displayed properly, or for > better readability check this screenshot: > http://postimg.org/image/e358ov40r/full/ > > Thank you in advance for your help. > Ok, so looking at these logs it's apparent that there are significantly fewer occurrences of ABORT_TASK. In fact, AFAICT there is only a single occurrence of ABORT_TASK in the entire log. This could be attributed to the reconfiguration to use a single LUN per endpoint, to avoid the false positive timeout issues that ESX is known to generate with multiple LUNs per TargetName+TargetPortalGroupTag endpoint.. However, looking at the single instance of ABORT_TASK in the log, something else appears to be happening with your backend: May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032235] Got Task Management Request ITT: 0x0027a82c, CmdSN: 0x3da72700, Function: 0x01, RefTaskTag: 0x0027a814, RefCmdSN: 0x25a72700, CID: 0 May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032266] ABORT_TASK: Found referenced iSCSI task_tag: 2598932 May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032271] wait_for_tasks: Stopping ffff8800bac15810 ITT: 0x0027a814 i_state: 6, t_state: 5, CMD_T_STOP The most interesting line is the last one wrt to wait_for_tasks.. Decoded, these i_state and t_state values mean: i_state: 6 (ISTATE_RECEIVED_LAST_DATAOUT) t_state: 5 (TRANSPORT_PROCESSING) The significance of the 'TRANSPORT_PROCESSING' t_state means that an I/O request was dispatched to the backend (iblock/24378_iscsi), but the underlying storage never completes the outstanding I/O back to the target layer. Or at least, this occurrence of ABORT_TASK is right near the end of the logs, and there is no debug output to indicate the I/O completion ever occurs. This usually means some type of problem with the underlying driver for the backend storage, as there is not a legitimate why outstanding I/Os would not be (eventually) completed back to IBLOCK, be it with GOOD or some manner of exception status. So that said, I would start investigating the underlying LLD driver for iblock/24378_iscsi (/dev/sdb).. What type of storage + LLD is it using..? Is the HBA using the latest available firmware..? Is there anything else special about this backend..? --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html