Re: ESXi + LIO + Ceph RBD problem

Martin Svec <martin.svec@xxxxxxxx> · Wed, 19 Aug 2015 13:02:02 +0200

Hi all,

thank you for sharing all the interesting tips and ideas. I agree with Steve that there're two
different issues. It makes sense to reduce default_cmdsn_depth if the backend storage is overloaded
and cannot handle more outstanding I/O in a timely manner. However, this doesn't help in case of
temporary backend outages like RAID disk or Ceph node failure where we know we definitely exceed 5
secs timeout and want to reset the sessions. ESXi does quite well when recovering from APD
conditions but it seems not to be this situation.

Steve, I was testing the same Pacemaker+DRBD setup as you in 2011 and decided to rewrite target
resource agent from scratch. The original one was too unreliable and slow. (Sorry, I cannot provide
it to public.) Note that I never saw ABORT_TASKs when running this setup on our Dell hardware.

Martin

Dne 19.8.2015 v 10:22 Steve Beaudry napsal(a):
> Thanks Nicolas,
>
>    I'll modify the resource agent script so that the default_cmdsn_depth can be set, and reduce
> the value to 16, based on your recommendation, amd see what impact it has.
>
>    I do still believe that we're talking about two different problems, one being performance and
> outstanding IOs timing out, while the other being a seeming incompatibility between LIO and ESX
> with regards to handling sessions when VMWare decides to restart a session which it does for a
> number of reasons (really, in response to any number of SCSI errors that pop).
>
> ...Steve...
>
>
> -------- Original message --------
> From: "Nicholas A. Bellinger"
> Date:08-19-2015 12:04 AM (GMT-08:00)
> To: Steve Beaudry
> Cc: Alex Gorbachev ,Martin Svec ,target-devel@xxxxxxxxxxxxxxx
> Subject: Re: ESXi + LIO + Ceph RBD problem
>
> On Wed, 2015-08-19 at 06:12 +0000, Steve Beaudry wrote:
> > Thanks for the tips Nicholas,
> >
> >   We've already been down the road of improving the performance based on what
> > you've mentioned, at least nearly everything...
> >
> > 1. The backend storage are arrays of dedicated disks, connected through about
> > the top end RAID cards from LSI, battery backed, write-back caching, and
> > including 400GB SSD drives doing read acceleration for "hot" data.  These
> > arrays are replicated using DRBD, across two separate hosts.. Read-balancing
> > is enabled in drbd, so both hosts are used when reading data (typically reads
> > are being striped across 8-10 disks)... Under "normal" circumstances, the
> > backend storage is very fast.  Unfortunately, things happen... Seagate drives
> > are failing at a ridiculous rate (a separate issue with Seagate that they are
> > addressing).  When a drive on either host fails, it can cause a timeout
> > significantly longer than "normal".  We've also seen other reasons for
> > timeouts occasionally occurring, and the end result is, because of a sequence
> > of events, a small timeout relating to a hardware RAID controller is causing
> > entire VMWare datacenters to hang, because the ESX server cannot restart the
> > connection to the LUNs, which is seemingly their method of dealing with
> > "hiccups".
> >
> > 2. The hardware queue depth of the LSI-9286CV-8eCC cards is 960, (with 256 per
> > array) default queue depth of 64 of LIO shouldn't be killing it.
>
> Using scsi-mq with LSI LLD code would help here, as Scsi_Host->host_lock
> and request_queue->queue_lock contention with many LUNs per Scsi_Host
> also increases I/O latency.
>
> > Unfortunately, there is some multiplication of that number, as we are running
> > 4 IQNs per host, so LIO is likely generating 256 in-flight commands to the
> > backend storage, spread across 4 arrays... still well under the 960 queue
> > depth the card is supposed to be capable of handling..   Because the IQNs/LUNs
> > are under the control of the Pacemaker cluster manager, setup of the IQNs/LUNs
> > happens immediately prior to the connections becoming active, so changing the
> > value of /sys/kernel/target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth
> > manually before the connection becomes active is not possible.  To complicate
> > the situation somewhat, the OCF/Heartbeat/iSCSITarget "resource agent" (really
> > a standardized script that Pacemaker uses to control LIO IQNs) doesn't have
> > the capability built-in to modify the "attribs" when starting a target.  Yes,
> > I could customize/extend this resource agent in our environment, but to do so,
> > deviating from the standard source code, when the hardware queue is already
> > significantly higher than the limit set in LIO, seems unneccesary.   We have
> > limited the Queue depth on the ESX side, but I know we were doing that with an
> > eye to the 256/960 queue depth of the LSI controller, so it's quite possible
> > that it is set higher than the 64 default of LIO.  I'll look into that.
> >
>
> The iscsi_target_mod default_cmdsn_depth=64 per IQN/TPGT endpoint
> assumes a small number of 10 Gb/sec hosts.  Based on your hardware
> configuration, it sounds like it needs to be much smaller.
>
> If your H/A scripts are not currently capable of setting
> default_cmdsn_depth, then you'll need to either hard-code it to a
> smaller value in iscsi_target_mod, or set default_cmdsn_depth + NodeACL
> cmdsn_depth via configfs before LUN exports are made active with:
>
>     echo 1 > ../target/iscsi/$WWN/$TPGT/enabled.
>
> > 3. We disabled VAAI long ago, as it certainly did exasperate the problem.
> >
> > 4. We are only using a single LUN per IQN.  We are, however, using 4 IQNs per
> > server, and two IPs (different subnets) per IQN.  We have ensured that VMWare
> > is not load balancing between the different paths, only using the non-active
> > path if the primary path happens to become unavailable...  This was done this
> > way because we wanted to be able to migrate individual arrays between cluster
> > nodes, without having to move them all at once.. This precludes using a single
> > IQN with multiple associated LUNs.
> >
> >
> > I believe, while tuning the system to its best should stop timeouts from
> > happening under theoretical, ideal conditions, that under real world
> > conditions, when drives fail, or other "hiccups" happen, LIO is unable to
> > allow ESX to recover from such events without stopping and restarting, which
> > is why I've asked if it's possible to allow LIO to operate with "an exception
> > to the strict rules of the SCSI SPEC".  It's not about handling it when things
> > are working at their optimum, it's about how the connections are handled when
> > something goes wrong.
> >
>
> I don't think allowing LIO to submit new backend I/Os when previous ones
> haven't yet completed is going to solve this issue. 
>
> If a backend I/O request does not complete within a fixed amount of time
> defined by fabric host timeouts, sending yet more backend I/O will not
> address the underlying problem of why the backend is not able to
> complete outstanding I/Os in a timely fashion.
>
> --nab
>

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html