RE: ESXi + LIO + Ceph RBD problem

Steve Beaudry <Steve.Beaudry@xxxxxxxxxxxxx> · Mon, 17 Aug 2015 22:10:13 +0000

Hey Guys,

   We're seeing exactly the same behaviour using ESXi + LIO + DRBD, using 
pacemaker/corosync to control the cluster...

   Under periods of heavy load (typically during backups), we occasionally see 
warnings in the logs exactly as you've mentioned:

	> [ 3052.065353] ABORT_TASK: Found referenced iSCSI task_tag: 801219 [
	> [ 3052.066370] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
	> [ 3082.714529] ABORT_TASK: Found referenced iSCSI task_tag:
	> [ 3082.714532] ABORT_TASK: ref_tag: 801223 already complete,
	> skipping [ 3082.714533] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST
	> for ref_tag: 801223 [ 3082.714536] ABORT_TASK: Found referenced iSCSI
	> task_tag: 801222 [ 3082.714540] ABORT_TASK: Sending
	> TMR_FUNCTION_COMPLETE for ref_tag: 801222

   We setup monitoring scripts that watch for these sorts of entries, followed 
by the inevitable LUN RESCAN that ESXi will perform when it can't talk to one 
of it's disks :

	 [261204.802785] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 
0x00000007
	 [261204.805443] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 
0x00000008
	 [261204.806166] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 
0x00000009
	 [261204.809172] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 
0x0000000a
	etc... for the next 200 or so lines..

  The only way we've found to deal with this is to migrate our primary storage 
to the second host in the cluster, unceremoniously killing the iSCSI stack on 
the initial host, and starting it on the second host.  All this is REALLY 
accomplishing is resetting the connections, and letting ESXi reconnect.

  We're fairly heavily invested in this setup, and my question is, is there a 
way to set a flag somewhere, or tweak a setting of code, to allow LIO to 
violate the strict rules of the SCSI SPEC, to allow this setup to work?  I'm 
going to HAVE to find a way around this very shortly, and I'd really rather 
the option not be "replace the in kernel iSCSI stack with TGT or SCST", 
because they allow that sort of thing.

...Steve...

-----Original Message-----
From: target-devel-owner@xxxxxxxxxxxxxxx 
[mailto:target-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Alex Gorbachev
Sent: August 16, 2015 7:01 PM
To: Martin Svec <martin.svec@xxxxxxxx>
Cc: target-devel@xxxxxxxxxxxxxxx
Subject: Re: ESXi + LIO + Ceph RBD problem

Hi Martin,

We tested and ran into similar scenarios.  Based on your description, you are 
running RBD client on an OSD node, which is not recommended.
As is not recommended to run OSD and MON on the same nodes.

The ABORT_TASK errors have always been related to Ceph timeouts.
Basically, my understanding is that an rdb times out, and after a brief grace 
period LIO and ESXi enter into a hailstorm of retry/reset commands, at which 
point IO pretty much won't resume.  LIO stays strict to the SCSI spec where 
one data session is to be maintained, whereas my understanding is other 
solutions like TGT and SCST allow another session to start and bypass this 
issue.

We have good results with three OSD nodes, 3 MONs, 2 LIO nodes with failover 
via Pacemaker and kernel 4.1.

Best regards,
Alex

On Fri, Aug 14, 2015 at 11:28 AM, Martin Svec <martin.svec@xxxxxxxx> wrote:
> Hello,
>
> I'm testing LIO iSCSI target on top of Ceph RBD as an iSCSI datastore
> for VMware vSphere. When one of the Ceph OSD nodes is terminated
> during heavy I/O (Storage VMotion to RBD), both initiator and target
> side report ABORT_TASK-related errors and all I/O is stopped. It's necessary 
> to drop iSCSI connections and let ESXi reconnect to continue.
>
> ESXi warnings:
>
> WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba35:CH:0 T:11 L:1 :
> Task mgmt "Abort Task" with
> itt=0x944d1 (refITT=0x944cd) timed out.
> VMW_SATP_ALUA: satp_alua_issueCommandOnPath:651: Path
> "vmhba35:C0:T11:L1" (UP) command 0xa3 failed with status Timeout. H:0x3 
> D:0x0 P:0x0  Possible sense data: 0x0 0x0 0x0.
> WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device
> "naa.60014057056b5748fdbb7c16c3a0bd46" state in doubt; requested fast path 
> state update...
> WARNING: iscsi_vmk: iscsivmk_TaskMgmtAbortCommands: vmhba35:CH:0 T:11
> L:1 : Abort task response indicates task with itt=0x944c7 has been
> completed on the target but the task response has not arrived ... and
> similar ones
>
> LIO warnings:
>
> [ 3052.065353] ABORT_TASK: Found referenced iSCSI task_tag: 801219 [
> 3052.066370] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
> 801219 [ 3082.714529] ABORT_TASK: Found referenced iSCSI task_tag:
> 801223 [ 3082.714532] ABORT_TASK: ref_tag: 801223 already complete,
> skipping [ 3082.714533] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST
> for ref_tag: 801223 [ 3082.714536] ABORT_TASK: Found referenced iSCSI
> task_tag: 801222 [ 3082.714540] ABORT_TASK: Sending
> TMR_FUNCTION_COMPLETE for ref_tag: 801222
>
> I guess the errors are related to the hardcoded 5000ms iSCSI timeout
> in ESXi, where RBD driver requires longer time to recover when one of
> the OSDs is lost. Is it possible? Does anybody have similar experience
> with ESXi + LIO iSCSI + Ceph? I tried to tweak few Ceph heartbeat options 
> but I'm still at the beginning of the learning curve...
>
> My Ceph setup is very basic now: 3 virtual machines with Debian Jessie
> and Ceph 0.80.7, one OSD and MON on each VM. The iSCSI LUN is
> published from one of the nodes via dedicated network adapter to the 
> underlying vSphere infrastructure.
>
> Thank you.
>
> Martin
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in the 
body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at 
http://vger.kernel.org/majordomo-info.html
Attachment:
smime.p7s

Description: S/MIME cryptographic signature