Re: ESXi + LIO + Ceph RBD problem

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 18 Aug 2015 21:24:43 -0700

On Mon, 2015-08-17 at 22:10 +0000, Steve Beaudry wrote:
> Hey Guys,
> 
>    We're seeing exactly the same behaviour using ESXi + LIO + DRBD, using 
> pacemaker/corosync to control the cluster...
> 
>    Under periods of heavy load (typically during backups), we occasionally see 
> warnings in the logs exactly as you've mentioned:
> 
> 	> [ 3052.065353] ABORT_TASK: Found referenced iSCSI task_tag: 801219 [
> 	> [ 3052.066370] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
> 	> [ 3082.714529] ABORT_TASK: Found referenced iSCSI task_tag:
> 	> [ 3082.714532] ABORT_TASK: ref_tag: 801223 already complete,
> 	> skipping [ 3082.714533] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST
> 	> for ref_tag: 801223 [ 3082.714536] ABORT_TASK: Found referenced iSCSI
> 	> task_tag: 801222 [ 3082.714540] ABORT_TASK: Sending
> 	> TMR_FUNCTION_COMPLETE for ref_tag: 801222
> 
>    We setup monitoring scripts that watch for these sorts of entries, followed 
> by the inevitable LUN RESCAN that ESXi will perform when it can't talk to one 
> of it's disks :
> 
> 	 [261204.802785] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 
> 0x00000007
> 	 [261204.805443] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 
> 0x00000008
> 	 [261204.806166] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 
> 0x00000009
> 	 [261204.809172] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 
> 0x0000000a
> 	etc... for the next 200 or so lines..
> 
>   The only way we've found to deal with this is to migrate our primary storage 
> to the second host in the cluster, unceremoniously killing the iSCSI stack on 
> the initial host, and starting it on the second host.  All this is REALLY 
> accomplishing is resetting the connections, and letting ESXi reconnect.
> 
>   We're fairly heavily invested in this setup, and my question is, is there a 
> way to set a flag somewhere, or tweak a setting of code, to allow LIO to 
> violate the strict rules of the SCSI SPEC, to allow this setup to work?  I'm 
> going to HAVE to find a way around this very shortly, and I'd really rather 
> the option not be "replace the in kernel iSCSI stack with TGT or SCST", 
> because they allow that sort of thing.
> 

To clarify, following iscsi spec in the previous ceph discussion meant
iscsi-target code waits for outstanding in-flight I/Os to complete,
before allowing a new session reinstatement to retry the same I/Os
again.

Having LIO 'violate the strict rules' in this context is not going to
prevent a ESX host from hitting internal SCSI timeout, when a backend
I/O takes longer than 5 seconds to respond under load.

So, if a backend storage configuration can't keep up with forward facing
I/O requirements, you need to:

* Reduce ../target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth
  and/or initiator NodeACL cmdsn_depth

  This attribute controls how many I/Os a single iscsi initiator can 
  keep in flight at a given time.

  Current default is 64, try 16 and work backwards in power of 2s from 
  there. Note any active sessions needs to be re-started for changes
  to this value have effect.

* Disable VAAI clone and/or zero primitive emulation

  Clone operations doing local EXTENDED_COPY emulation can generate 
  significant back-end I/O load, and disabling it for some or all 
  backends might help mask the unbounded I/O latency to ESX.

  ../target/core/$HBA/$DEV/attrib/emulate_3pc controls reporting the 
  EXTENDED_COPY feature bit to ESX.  You can verify with esxtop that it 
  has been disabled.

* Use a single LUN mapping per IQN/TPGT endpoint

  The ESX iSCSI initiator has well known fairness issues that can 
  generate false positive internal SCSI timeouts under load, when
  multiple LUN mappings exist on the same target IQN/TPGT endpoint.

  To avoid hitting these types of ESX false positives, using a single 
  LUN (LUN=0) mapping per target IQN/TPGT endpoint might also help.

But keep in mind these tunables will only mask the issue of unbounded
backend I/O latency.  If backend I/O latency is on the order of minutes,
and not just a few seconds currently over the expected 5 second internal
timeout, changing iscsi-target code is still not going to make ESX
happy.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html