RE: ESXi + LIO + Ceph RBD problem

Steve Beaudry <Steve.Beaudry@xxxxxxxxxxxxx> · Wed, 19 Aug 2015 06:27:39 +0000

Follow up to this... Thinking about Queue depths has made me remember 
something somewhat relevant to our situation, but likely unrelated to LIO 
specifically.. Still, I wanted to mention it for the sake of completeness, as 
I mentioned that we'd limited queue depth in VMWare ESX...

We're normally only seeing problems while our backups are running.. Our 
backups are not limited by the VMWare ESX Queue depth, as we use a backup 
product that connects to the iSCSI directly from a different host 
(coordinating with VMWare, taking snapshots to avoid concurrent access 
problems).

While we've limited the Queue depth of the ESX servers, we've never been able 
to find a way to limit the queue depth of the backup software.  While taking a 
snapshot deals with the "concurrent access to a file" problem, it does not 
deal with the fact that our backup server has the ability to somewhat swamp 
the LUN with requests.

VMWare's documentation notes that higher than normal latency is common during 
this type of activity, and triggering TIMEOUTS is quite possibly normal.. 
Unfortunately, again, it seems that VMWare sees it as a perfectly normal 
method of dealing with a "hiccup" to reset the LUN connection, which is where 
my original request came in...

-----Original Message-----
From: target-devel-owner@xxxxxxxxxxxxxxx 
[mailto:target-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Steve Beaudry
Sent: August 18, 2015 11:13 PM
To: Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx>
Cc: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Martin Svec 
<martin.svec@xxxxxxxx>; target-devel@xxxxxxxxxxxxxxx
Subject: RE: ESXi + LIO + Ceph RBD problem

Thanks for the tips Nicholas,

  We've already been down the road of improving the performance based on what
you've mentioned, at least nearly everything...

1. The backend storage are arrays of dedicated disks, connected through about
the top end RAID cards from LSI, battery backed, write-back caching, and
including 400GB SSD drives doing read acceleration for "hot" data.  These
arrays are replicated using DRBD, across two separate hosts.. Read-balancing
is enabled in drbd, so both hosts are used when reading data (typically reads
are being striped across 8-10 disks)... Under "normal" circumstances, the
backend storage is very fast.  Unfortunately, things happen... Seagate drives
are failing at a ridiculous rate (a separate issue with Seagate that they are
addressing).  When a drive on either host fails, it can cause a timeout
significantly longer than "normal".  We've also seen other reasons for
timeouts occasionally occurring, and the end result is, because of a sequence
of events, a small timeout relating to a hardware RAID controller is causing
entire VMWare datacenters to hang, because the ESX server cannot restart the
connection to the LUNs, which is seemingly their method of dealing with
"hiccups".

2. The hardware queue depth of the LSI-9286CV-8eCC cards is 960, (with 256 per
array) default queue depth of 64 of LIO shouldn't be killing it.
Unfortunately, there is some multiplication of that number, as we are running
4 IQNs per host, so LIO is likely generating 256 in-flight commands to the
backend storage, spread across 4 arrays... still well under the 960 queue
depth the card is supposed to be capable of handling..   Because the IQNs/LUNs
are under the control of the Pacemaker cluster manager, setup of the IQNs/LUNs
happens immediately prior to the connections becoming active, so changing the
value of /sys/kernel/target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth
manually before the connection becomes active is not possible.  To complicate
the situation somewhat, the OCF/Heartbeat/iSCSITarget "resource agent" (really
a standardized script that Pacemaker uses to control LIO IQNs) doesn't have
the capability built-in to modify the "attribs" when starting a target.  Yes,
I could customize/extend this resource agent in our environment, but to do so,
deviating from the standard source code, when the hardware queue is already
significantly higher than the limit set in LIO, seems unneccesary.   We have
limited the Queue depth on the ESX side, but I know we were doing that with an
eye to the 256/960 queue depth of the LSI controller, so it's quite possible
that it is set higher than the 64 default of LIO.  I'll look into that.

3. We disabled VAAI long ago, as it certainly did exasperate the problem.

4. We are only using a single LUN per IQN.  We are, however, using 4 IQNs per
server, and two IPs (different subnets) per IQN.  We have ensured that VMWare
is not load balancing between the different paths, only using the non-active
path if the primary path happens to become unavailable...  This was done this
way because we wanted to be able to migrate individual arrays between cluster
nodes, without having to move them all at once.. This precludes using a single
IQN with multiple associated LUNs.

I believe, while tuning the system to its best should stop timeouts from
happening under theoretical, ideal conditions, that under real world
conditions, when drives fail, or other "hiccups" happen, LIO is unable to
allow ESX to recover from such events without stopping and restarting, which
is why I've asked if it's possible to allow LIO to operate with "an exception
to the strict rules of the SCSI SPEC".  It's not about handling it when things
are working at their optimum, it's about how the connections are handled when
something goes wrong.

As a small aside, you're quite correct that if there is no timeout on the
storage backend, there is no problem.. We have the same setup running on 4
identical storage hosts, with the only difference being that the spinning
disks in those hosts are replaced with 400GB Intel DCS3700 SSD drives, and
we've never, ever, encountered a timeout on those arrays (go figure).  I
suspect though, if we should lose a drive, and things should slow down due to
a RAID card suddenly needing to read every other drive, and calculate the
resulting missing bit, that even the SSD based hosts may suffer the same "need
to restart the target" symptom.

Cheers, and thank you again for the work you've done on this, and for the
pointers and help thus far...

...Steve...

-----Original Message-----
From: Nicholas A. Bellinger [mailto:nab@xxxxxxxxxxxxxxx]
Sent: August 18, 2015 9:25 PM
To: Steve Beaudry <Steve.Beaudry@xxxxxxxxxxxxx>
Cc: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Martin Svec
<martin.svec@xxxxxxxx>; target-devel@xxxxxxxxxxxxxxx
Subject: Re: ESXi + LIO + Ceph RBD problem

On Mon, 2015-08-17 at 22:10 +0000, Steve Beaudry wrote:
> Hey Guys,
>
>    We're seeing exactly the same behaviour using ESXi + LIO + DRBD,
> using pacemaker/corosync to control the cluster...
>
>    Under periods of heavy load (typically during backups), we
> occasionally see warnings in the logs exactly as you've mentioned:
>
> 	> [ 3052.065353] ABORT_TASK: Found referenced iSCSI task_tag: 801219 [
> 	> [ 3052.066370] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
> 	> [ 3082.714529] ABORT_TASK: Found referenced iSCSI task_tag:
> 	> [ 3082.714532] ABORT_TASK: ref_tag: 801223 already complete,
> 	> skipping [ 3082.714533] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST
> 	> for ref_tag: 801223 [ 3082.714536] ABORT_TASK: Found referenced iSCSI
> 	> task_tag: 801222 [ 3082.714540] ABORT_TASK: Sending
> 	> TMR_FUNCTION_COMPLETE for ref_tag: 801222
>
>    We setup monitoring scripts that watch for these sorts of entries,
> followed by the inevitable LUN RESCAN that ESXi will perform when it
> can't talk to one of it's disks :
>
> 	 [261204.802785] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access
> for
> 0x00000007
> 	 [261204.805443] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access
> for
> 0x00000008
> 	 [261204.806166] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access
> for
> 0x00000009
> 	 [261204.809172] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access
> for 0x0000000a
> 	etc... for the next 200 or so lines..
>
>   The only way we've found to deal with this is to migrate our primary
> storage to the second host in the cluster, unceremoniously killing the
> iSCSI stack on the initial host, and starting it on the second host.
> All this is REALLY accomplishing is resetting the connections, and letting
> ESXi reconnect.
>
>   We're fairly heavily invested in this setup, and my question is, is
> there a way to set a flag somewhere, or tweak a setting of code, to
> allow LIO to violate the strict rules of the SCSI SPEC, to allow this
> setup to work?  I'm going to HAVE to find a way around this very
> shortly, and I'd really rather the option not be "replace the in
> kernel iSCSI stack with TGT or SCST", because they allow that sort of thing.
>

To clarify, following iscsi spec in the previous ceph discussion meant
iscsi-target code waits for outstanding in-flight I/Os to complete, before
allowing a new session reinstatement to retry the same I/Os again.

Having LIO 'violate the strict rules' in this context is not going to prevent
a ESX host from hitting internal SCSI timeout, when a backend I/O takes longer
than 5 seconds to respond under load.

So, if a backend storage configuration can't keep up with forward facing I/O
requirements, you need to:

* Reduce ../target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth
  and/or initiator NodeACL cmdsn_depth

  This attribute controls how many I/Os a single iscsi initiator can
  keep in flight at a given time.

  Current default is 64, try 16 and work backwards in power of 2s from
  there. Note any active sessions needs to be re-started for changes
  to this value have effect.

* Disable VAAI clone and/or zero primitive emulation

  Clone operations doing local EXTENDED_COPY emulation can generate
  significant back-end I/O load, and disabling it for some or all
  backends might help mask the unbounded I/O latency to ESX.

  ../target/core/$HBA/$DEV/attrib/emulate_3pc controls reporting the
  EXTENDED_COPY feature bit to ESX.  You can verify with esxtop that it
  has been disabled.

* Use a single LUN mapping per IQN/TPGT endpoint

  The ESX iSCSI initiator has well known fairness issues that can
  generate false positive internal SCSI timeouts under load, when
  multiple LUN mappings exist on the same target IQN/TPGT endpoint.

  To avoid hitting these types of ESX false positives, using a single
  LUN (LUN=0) mapping per target IQN/TPGT endpoint might also help.

But keep in mind these tunables will only mask the issue of unbounded backend
I/O latency.  If backend I/O latency is on the order of minutes, and not just
a few seconds currently over the expected 5 second internal timeout, changing
iscsi-target code is still not going to make ESX happy.

--nab

Attachment:
smime.p7s

Description: S/MIME cryptographic signature