Re: Connectivity problems with ISCSI target and ESXi server(s)

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 22 May 2014 18:14:19 -0700

On Wed, 2014-05-21 at 13:01 +0300, Charalampos Pournaris wrote:
> Hi Thomas,
> 
> On Tue, May 20, 2014 at 10:44 PM, Thomas Glanzmann <thomas@xxxxxxxxxxxx> wrote:
> > Hello Harry,
> >
> >> http://pastebin.com/AqqJaYVX
> >
> > I checked my log from 11th October last year when this happened to me,
> > and for me it looks like the same error we're hitting:
> >
> > ...
> > Oct 11 11:53:56 node-62 kernel: [219465.151250] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5488
> > Oct 11 11:53:56 node-62 kernel: [219465.151261] ABORT_TASK: Found referenced iSCSI task_tag: 5494
> > Oct 11 11:53:56 node-62 kernel: [219465.151264] ABORT_TASK: ref_tag: 5494 already complete, skipping
> > Oct 11 11:53:56 node-62 kernel: [219465.151267] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5494
> > Oct 11 11:53:56 node-62 kernel: [219465.151271] ABORT_TASK: Found referenced iSCSI task_tag: 5495
> > Oct 11 11:53:56 node-62 kernel: [219465.151273] ABORT_TASK: ref_tag: 5495 already complete, skipping
> > Oct 11 11:53:56 node-62 kernel: [219465.151275] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5495
> > Oct 11 11:54:09 node-62 kernel: [219478.744212] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000008
> > Oct 11 11:54:09 node-62 kernel: [219478.751738] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5508
> > Oct 11 11:54:23 node-62 kernel: [219492.351282] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000013
> > Oct 11 11:54:23 node-62 kernel: [219492.358819] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5514
> > Oct 11 11:54:23 node-62 kernel: [219492.630489] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000001d
> > ...
> >
> > Full log here: https://thomas.glanzmann.de/crash/extracted_crash_dmesg.log
> >
> > I'll try to reproduce it and let the list know when I find something.
> > Harry, can you let me know how many VM's you had running and if they
> > were thin or thick provisioned and if they're idle or they had a heavy
> > workload.
> >
> > In my case I did _not_ use jumbo frames, but 802.3ad bonding. My
> > switches are configured to be able to deliver jumbo frames.
> >
> > Cheers,
> >         Thomas
> 
> Indeed, it seems that we hit the same issue as the log lines look
> pretty similar.
> 

One other thing that has historically caused ESX issues is having too
many LUNs on a individual iSCSI-Target TargetName+TargetPortalGroupTag
endpoint.

This typically manifests itself as false positive I/O timeouts (on the
ESX host side) when multiple LUNs are exported over 1 Gb/sec Ethernet on
a single endpoint. (eg: accessed over a single session).

Usually spreading out the LUNs across multiple endpoints (eg: multiple
IQNs) helps to reduce or completely eliminate the ESX host side issues.
Typically you want to keep it around 4-5 LUNs per endpoint on 1 Gb/sec
ports.

> Our setup was comprised of around 7-10 VMs powered on with some
> activity (not too intense), and if I recall correctly some VM
> deployments (through OVF/OVA) had been made prior to the failure.
> Obviously, there are no discrete steps to reproduce the problem... I
> hope this can help reproduce the problem in your environment as it's
> kind of difficult for me to make changes in our production one (e.g.
> to recompile the kernel with debug on).
> 

FYI, as long as CONFIG_DYNAMIC_DEBUG is compiled in, the debug output
for the target + associated drivers can be turned on/off on the fly
using debugfs.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html