Re: Connectivity problems with ISCSI target and ESXi server(s)

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 10 Jun 2014 11:03:40 -0700

On Tue, 2014-06-10 at 13:17 +0300, Charalampos Pournaris wrote:
> On Tue, Jun 3, 2014 at 10:37 PM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
> > On Sat, 2014-05-31 at 13:03 +0300, Charalampos Pournaris wrote:
> >> On Sun, May 25, 2014 at 9:42 AM, Thomas Glanzmann <thomas@xxxxxxxxxxxx> wrote:

<SNIP>

> > The configuration looks fine.  Thanks for including the extra info..
> >
> >>
> >> If for some reason the formatting is not displayed properly, or for
> >> better readability check this screenshot:
> >> http://postimg.org/image/e358ov40r/full/
> >>
> >> Thank you in advance for your help.
> >>
> >
> > Ok, so looking at these logs it's apparent that there are significantly
> > fewer occurrences of ABORT_TASK.  In fact, AFAICT there is only a single
> > occurrence of ABORT_TASK in the entire log.
> >
> > This could be attributed to the reconfiguration to use a single LUN per
> > endpoint, to avoid the false positive timeout issues that ESX is known
> > to generate with multiple LUNs per TargetName+TargetPortalGroupTag
> > endpoint..
> >
> > However, looking at the single instance of ABORT_TASK in the log,
> > something else appears to be happening with your backend:
> >
> > May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032235] Got Task Management Request ITT: 0x0027a82c, CmdSN: 0x3da72700, Function: 0x01, RefTaskTag: 0x0027a814, RefCmdSN: 0x25a72700, CID: 0
> > May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032266] ABORT_TASK: Found referenced iSCSI task_tag: 2598932
> > May 30 08:54:58 sof-24378-iscsi-vm kernel: [105260.032271] wait_for_tasks: Stopping ffff8800bac15810 ITT: 0x0027a814 i_state: 6, t_state: 5, CMD_T_STOP
> >
> > The most interesting line is the last one wrt to wait_for_tasks..
> >
> > Decoded, these i_state and t_state values mean:
> >
> >     i_state: 6 (ISTATE_RECEIVED_LAST_DATAOUT)
> >     t_state: 5 (TRANSPORT_PROCESSING)
> >
> > The significance of the 'TRANSPORT_PROCESSING' t_state means that an I/O
> > request was dispatched to the backend (iblock/24378_iscsi), but the
> > underlying storage never completes the outstanding I/O back to the
> > target layer.  Or at least, this occurrence of ABORT_TASK is right near
> > the end of the logs, and there is no debug output to indicate the I/O
> > completion ever occurs.
> >
> > This usually means some type of problem with the underlying driver for
> > the backend storage, as there is not a legitimate why outstanding I/Os
> > would not be (eventually) completed back to IBLOCK, be it with GOOD or
> > some manner of exception status.
> >
> > So that said, I would start investigating the underlying LLD driver for
> > iblock/24378_iscsi (/dev/sdb)..  What type of storage + LLD is it
> > using..?  Is the HBA using the latest available firmware..?  Is there
> > anything else special about this backend..?
> >
> > --nab
> >
> 
> Hi Nicholas,
> 
> First of all thanks for the detailed explanation. It seems that this
> last problem that we've hit is different from the one reported
> initially since it caused a kernel panic whereas the other issue makes
> the datastore(s) inactive, constantly throwing a login error. By the
> way, I'm using linux VMs (debian) to expose the iSCSI datastores (i.e.
> /dev/sdb is a local drive) and thus it doesn't use any special
> hardware/firmware. The only possible issue I can think of is the
> vmware-tools driver might be the culprit for the incomplete I/O, or
> some kernel driver?
> 
> I've hit again the issue initially reported and have fresh logs to
> share. I'll send a separate mail with a link to the new logs.
> Additionally, when I attempted to stop the target service, it got stuck:
> 

Ok, can you please apply the following three patches to your setup..?

target: Set CMD_T_ACTIVE bit for Task Management Requests
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=f15e9cd910c4d9da7de43f2181f362082fc45f0f

target: Use complete_all for se_cmd->t_transport_stop_comp
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=a95d6511303b848da45ee27b35018bb58087bdc6

iscsi-target: Fix ABORT_TASK + connection reset iscsi_queue_req memory leak
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=bbc050488525e1ab1194c27355f63c66814385b8

This address the bug where a backend I/O takes a long time (say over 120
seconds) to process, causing a ABORT_TASK + iscsi0session reset to occur
before the backend I/O completes.  A more detailed explanation is here:

http://permalink.gmane.org/gmane.linux.scsi.target.devel/6489

However note, this addresses the case where the backend I/O takes a long
time to complete, but it still *needs* to complete at some point.

What I'm not sure about at this point is if your backend is just taking
an extra long time to complete I/O, or if it's a separate bug in the LLD
that causes I/O to never complete..

In any event, please try to reproduce with the above three places in
place.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html