RE: ESXi + LIO + Ceph RBD problem

Steve Beaudry <Steve.Beaudry@xxxxxxxxxxxxx> · Wed, 19 Aug 2015 06:12:59 +0000

Thanks for the tips Nicholas,

  We've already been down the road of improving the performance based on what 
you've mentioned, at least nearly everything...

1. The backend storage are arrays of dedicated disks, connected through about 
the top end RAID cards from LSI, battery backed, write-back caching, and 
including 400GB SSD drives doing read acceleration for "hot" data.  These 
arrays are replicated using DRBD, across two separate hosts.. Read-balancing 
is enabled in drbd, so both hosts are used when reading data (typically reads 
are being striped across 8-10 disks)... Under "normal" circumstances, the 
backend storage is very fast.  Unfortunately, things happen... Seagate drives 
are failing at a ridiculous rate (a separate issue with Seagate that they are 
addressing).  When a drive on either host fails, it can cause a timeout 
significantly longer than "normal".  We've also seen other reasons for 
timeouts occasionally occurring, and the end result is, because of a sequence 
of events, a small timeout relating to a hardware RAID controller is causing 
entire VMWare datacenters to hang, because the ESX server cannot restart the 
connection to the LUNs, which is seemingly their method of dealing with 
"hiccups".

2. The hardware queue depth of the LSI-9286CV-8eCC cards is 960, (with 256 per 
array) default queue depth of 64 of LIO shouldn't be killing it. 
Unfortunately, there is some multiplication of that number, as we are running 
4 IQNs per host, so LIO is likely generating 256 in-flight commands to the 
backend storage, spread across 4 arrays... still well under the 960 queue 
depth the card is supposed to be capable of handling..   Because the IQNs/LUNs 
are under the control of the Pacemaker cluster manager, setup of the IQNs/LUNs 
happens immediately prior to the connections becoming active, so changing the 
value of /sys/kernel/target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth 
manually before the connection becomes active is not possible.  To complicate 
the situation somewhat, the OCF/Heartbeat/iSCSITarget "resource agent" (really 
a standardized script that Pacemaker uses to control LIO IQNs) doesn't have 
the capability built-in to modify the "attribs" when starting a target.  Yes, 
I could customize/extend this resource agent in our environment, but to do so, 
deviating from the standard source code, when the hardware queue is already 
significantly higher than the limit set in LIO, seems unneccesary.   We have 
limited the Queue depth on the ESX side, but I know we were doing that with an 
eye to the 256/960 queue depth of the LSI controller, so it's quite possible 
that it is set higher than the 64 default of LIO.  I'll look into that.

3. We disabled VAAI long ago, as it certainly did exasperate the problem.

4. We are only using a single LUN per IQN.  We are, however, using 4 IQNs per 
server, and two IPs (different subnets) per IQN.  We have ensured that VMWare 
is not load balancing between the different paths, only using the non-active 
path if the primary path happens to become unavailable...  This was done this 
way because we wanted to be able to migrate individual arrays between cluster 
nodes, without having to move them all at once.. This precludes using a single 
IQN with multiple associated LUNs.

I believe, while tuning the system to its best should stop timeouts from 
happening under theoretical, ideal conditions, that under real world 
conditions, when drives fail, or other "hiccups" happen, LIO is unable to 
allow ESX to recover from such events without stopping and restarting, which 
is why I've asked if it's possible to allow LIO to operate with "an exception 
to the strict rules of the SCSI SPEC".  It's not about handling it when things 
are working at their optimum, it's about how the connections are handled when 
something goes wrong.

As a small aside, you're quite correct that if there is no timeout on the 
storage backend, there is no problem.. We have the same setup running on 4 
identical storage hosts, with the only difference being that the spinning 
disks in those hosts are replaced with 400GB Intel DCS3700 SSD drives, and 
we've never, ever, encountered a timeout on those arrays (go figure).  I 
suspect though, if we should lose a drive, and things should slow down due to 
a RAID card suddenly needing to read every other drive, and calculate the 
resulting missing bit, that even the SSD based hosts may suffer the same "need 
to restart the target" symptom.

Cheers, and thank you again for the work you've done on this, and for the 
pointers and help thus far...

...Steve...

-----Original Message-----
From: Nicholas A. Bellinger [mailto:nab@xxxxxxxxxxxxxxx]
Sent: August 18, 2015 9:25 PM
To: Steve Beaudry <Steve.Beaudry@xxxxxxxxxxxxx>
Cc: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Martin Svec 
<martin.svec@xxxxxxxx>; target-devel@xxxxxxxxxxxxxxx
Subject: Re: ESXi + LIO + Ceph RBD problem

On Mon, 2015-08-17 at 22:10 +0000, Steve Beaudry wrote:
> Hey Guys,
>
>    We're seeing exactly the same behaviour using ESXi + LIO + DRBD,
> using pacemaker/corosync to control the cluster...
>
>    Under periods of heavy load (typically during backups), we
> occasionally see warnings in the logs exactly as you've mentioned:
>
> 	> [ 3052.065353] ABORT_TASK: Found referenced iSCSI task_tag: 801219 [
> 	> [ 3052.066370] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
> 	> [ 3082.714529] ABORT_TASK: Found referenced iSCSI task_tag:
> 	> [ 3082.714532] ABORT_TASK: ref_tag: 801223 already complete,
> 	> skipping [ 3082.714533] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST
> 	> for ref_tag: 801223 [ 3082.714536] ABORT_TASK: Found referenced iSCSI
> 	> task_tag: 801222 [ 3082.714540] ABORT_TASK: Sending
> 	> TMR_FUNCTION_COMPLETE for ref_tag: 801222
>
>    We setup monitoring scripts that watch for these sorts of entries,
> followed by the inevitable LUN RESCAN that ESXi will perform when it
> can't talk to one of it's disks :
>
> 	 [261204.802785] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access
> for
> 0x00000007
> 	 [261204.805443] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access
> for
> 0x00000008
> 	 [261204.806166] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access
> for
> 0x00000009
> 	 [261204.809172] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access
> for 0x0000000a
> 	etc... for the next 200 or so lines..
>
>   The only way we've found to deal with this is to migrate our primary
> storage to the second host in the cluster, unceremoniously killing the
> iSCSI stack on the initial host, and starting it on the second host.
> All this is REALLY accomplishing is resetting the connections, and letting 
> ESXi reconnect.
>
>   We're fairly heavily invested in this setup, and my question is, is
> there a way to set a flag somewhere, or tweak a setting of code, to
> allow LIO to violate the strict rules of the SCSI SPEC, to allow this
> setup to work?  I'm going to HAVE to find a way around this very
> shortly, and I'd really rather the option not be "replace the in
> kernel iSCSI stack with TGT or SCST", because they allow that sort of thing.
>

To clarify, following iscsi spec in the previous ceph discussion meant 
iscsi-target code waits for outstanding in-flight I/Os to complete, before 
allowing a new session reinstatement to retry the same I/Os again.

Having LIO 'violate the strict rules' in this context is not going to prevent 
a ESX host from hitting internal SCSI timeout, when a backend I/O takes longer 
than 5 seconds to respond under load.

So, if a backend storage configuration can't keep up with forward facing I/O 
requirements, you need to:

* Reduce ../target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth
  and/or initiator NodeACL cmdsn_depth

  This attribute controls how many I/Os a single iscsi initiator can
  keep in flight at a given time.

  Current default is 64, try 16 and work backwards in power of 2s from
  there. Note any active sessions needs to be re-started for changes
  to this value have effect.

* Disable VAAI clone and/or zero primitive emulation

  Clone operations doing local EXTENDED_COPY emulation can generate
  significant back-end I/O load, and disabling it for some or all
  backends might help mask the unbounded I/O latency to ESX.

  ../target/core/$HBA/$DEV/attrib/emulate_3pc controls reporting the
  EXTENDED_COPY feature bit to ESX.  You can verify with esxtop that it
  has been disabled.

* Use a single LUN mapping per IQN/TPGT endpoint

  The ESX iSCSI initiator has well known fairness issues that can
  generate false positive internal SCSI timeouts under load, when
  multiple LUN mappings exist on the same target IQN/TPGT endpoint.

  To avoid hitting these types of ESX false positives, using a single
  LUN (LUN=0) mapping per target IQN/TPGT endpoint might also help.

But keep in mind these tunables will only mask the issue of unbounded backend 
I/O latency.  If backend I/O latency is on the order of minutes, and not just 
a few seconds currently over the expected 5 second internal timeout, changing 
iscsi-target code is still not going to make ESX happy.

--nab

Attachment:
smime.p7s

Description: S/MIME cryptographic signature