Re: Connectivity problems with ISCSI target and ESXi server(s)

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 03 Jun 2014 12:37:47 -0700

On Sat, 2014-05-31 at 13:03 +0300, Charalampos Pournaris wrote:
> On Sun, May 25, 2014 at 9:42 AM, Thomas Glanzmann <thomas@xxxxxxxxxxxx> wrote:
> > Hallo Nab,
> >
> >> Usually spreading out the LUNs across multiple endpoints (eg: multiple
> >> IQNs) helps to reduce or completely eliminate the ESX host side issues.
> >> Typically you want to keep it around 4-5 LUNs per endpoint on 1 Gb/sec
> >> ports.
> >
> > in my case I had 3 LUNs per endpoint. With rts/os with the same config
> > it worked. But as soon as my time permits it, I'll hunt it down.
> >
> > Cheers,
> >         Thomas
> 
> Hi again,
> 
> I managed to enable dynamic debugging on the Debian VMs as Thomas
> suggested and I've got some logs as the problem manifested itself
> again. More specifically, I started seeing datastores losing
> connection to the attached hosts and at some point the VM that hosts
> the iSCSI target crashed with a kernel panic.
> 
> I've got an enormous amount of logs but I'm not quite sure which ones
> would more helpful to you: kern.log, debug, syslog, messages or all of
> them? All of them gziped have a size of around 700 mb (unzipped each
> one is around 2 GB). Would you be able to have a look at such big
> files? (I guess by targeting some specific output)
> 
> By the way, I had also created multiple IQNs (one for each host
> accessing the LUN) as per Nicholas' suggestion (if I got it right). It
> can be seen below:
> 

<SNIP>

The configuration looks fine.  Thanks for including the extra info..

> 
> If for some reason the formatting is not displayed properly, or for
> better readability check this screenshot:
> http://postimg.org/image/e358ov40r/full/
> 
> Thank you in advance for your help.
> 

Ok, so looking at these logs it's apparent that there are significantly
fewer occurrences of ABORT_TASK.  In fact, AFAICT there is only a single
occurrence of ABORT_TASK in the entire log.

This could be attributed to the reconfiguration to use a single LUN per
endpoint, to avoid the false positive timeout issues that ESX is known
to generate with multiple LUNs per TargetName+TargetPortalGroupTag
endpoint..

However, looking at the single instance of ABORT_TASK in the log,
something else appears to be happening with your backend:

May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032235] Got Task Management Request ITT: 0x0027a82c, CmdSN: 0x3da72700, Function: 0x01, RefTaskTag: 0x0027a814, RefCmdSN: 0x25a72700, CID: 0
May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032266] ABORT_TASK: Found referenced iSCSI task_tag: 2598932
May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032271] wait_for_tasks: Stopping ffff8800bac15810 ITT: 0x0027a814 i_state: 6, t_state: 5, CMD_T_STOP

The most interesting line is the last one wrt to wait_for_tasks..

Decoded, these i_state and t_state values mean:

    i_state: 6 (ISTATE_RECEIVED_LAST_DATAOUT)
    t_state: 5 (TRANSPORT_PROCESSING)

The significance of the 'TRANSPORT_PROCESSING' t_state means that an I/O
request was dispatched to the backend (iblock/24378_iscsi), but the
underlying storage never completes the outstanding I/O back to the
target layer.  Or at least, this occurrence of ABORT_TASK is right near
the end of the logs, and there is no debug output to indicate the I/O
completion ever occurs.

This usually means some type of problem with the underlying driver for
the backend storage, as there is not a legitimate why outstanding I/Os
would not be (eventually) completed back to IBLOCK, be it with GOOD or
some manner of exception status.

So that said, I would start investigating the underlying LLD driver for
iblock/24378_iscsi (/dev/sdb)..  What type of storage + LLD is it
using..?  Is the HBA using the latest available firmware..?  Is there
anything else special about this backend..?

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html