Re: ESXi + LIO + Ceph RBD problem

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Wed, 19 Aug 2015 12:16:21 -0400

I have to say that changing default_cmdsn_depth did not help us with
the abnormal timeouts, i.e. OSD failing or some other abrupt event.
When that happens we detect the event via ABORT_TASK and if the event
is transient usually nothing happens.  Anything more than a few
seconds will usually result in Ceph recovery but ESXi gets stuck and
never comes out of APD.  Looks like it tries to establish another
session by bombarding the target with retries and resets, and
ultimately gives up and goes to PDL state.  Then the only option is
reboot.

So to be clear, we have moved on from a discussion about slow storage
to a discussion about what happens during unexpected and abnormal
timeouts.  Anecdotal evidence suggests that SCST based systems will
allow ESXi recover from this condition, while ESXi does not play as
well with LIO based systems in those situations.

What is the difference, and is there willingness to allow LIO to be
modified to work with this ESXi behavior?  Or should we ask Vmware to
do something for ESXi to play better with LIO?  I cannot fix the code,
but would be happy to be the voice of the issue via any available
channels.

Best regards,
Alex

On Wed, Aug 19, 2015 at 4:22 AM, Steve Beaudry
<Steve.Beaudry@xxxxxxxxxxxxx> wrote:
> Thanks Nicolas,
>
>    I'll modify the resource agent script so that the default_cmdsn_depth can
> be set, and reduce the value to 16, based on your recommendation, amd see
> what impact it has.
>
>    I do still believe that we're talking about two different problems, one
> being performance and outstanding IOs timing out, while the other being a
> seeming incompatibility between LIO and ESX with regards to handling
> sessions when VMWare decides to restart a session which it does for a number
> of reasons (really, in response to any number of SCSI errors that pop).
>
> ...Steve...
>
>
> -------- Original message --------
> From: "Nicholas A. Bellinger"
> Date:08-19-2015 12:04 AM (GMT-08:00)
> To: Steve Beaudry
> Cc: Alex Gorbachev ,Martin Svec ,target-devel@xxxxxxxxxxxxxxx
> Subject: Re: ESXi + LIO + Ceph RBD problem
>
> On Wed, 2015-08-19 at 06:12 +0000, Steve Beaudry wrote:
>> Thanks for the tips Nicholas,
>>
>>   We've already been down the road of improving the performance based on
>> what
>> you've mentioned, at least nearly everything...
>>
>> 1. The backend storage are arrays of dedicated disks, connected through
>> about
>> the top end RAID cards from LSI, battery backed, write-back caching, and
>> including 400GB SSD drives doing read acceleration for "hot" data.  These
>> arrays are replicated using DRBD, across two separate hosts..
>> Read-balancing
>> is enabled in drbd, so both hosts are used when reading data (typically
>> reads
>> are being striped across 8-10 disks)... Under "normal" circumstances, the
>> backend storage is very fast.  Unfortunately, things happen... Seagate
>> drives
>> are failing at a ridiculous rate (a separate issue with Seagate that they
>> are
>> addressing).  When a drive on either host fails, it can cause a timeout
>> significantly longer than "normal".  We've also seen other reasons for
>> timeouts occasionally occurring, and the end result is, because of a
>> sequence
>> of events, a small timeout relating to a hardware RAID controller is
>> causing
>> entire VMWare datacenters to hang, because the ESX server cannot restart
>> the
>> connection to the LUNs, which is seemingly their method of dealing with
>> "hiccups".
>>
>> 2. The hardware queue depth of the LSI-9286CV-8eCC cards is 960, (with 256
>> per
>> array) default queue depth of 64 of LIO shouldn't be killing it.
>
> Using scsi-mq with LSI LLD code would help here, as Scsi_Host->host_lock
> and request_queue->queue_lock contention with many LUNs per Scsi_Host
> also increases I/O latency.
>
>> Unfortunately, there is some multiplication of that number, as we are
>> running
>> 4 IQNs per host, so LIO is likely generating 256 in-flight commands to the
>> backend storage, spread across 4 arrays... still well under the 960 queue
>> depth the card is supposed to be capable of handling..   Because the
>> IQNs/LUNs
>> are under the control of the Pacemaker cluster manager, setup of the
>> IQNs/LUNs
>> happens immediately prior to the connections becoming active, so changing
>> the
>> value of
>> /sys/kernel/target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth
>> manually before the connection becomes active is not possible.  To
>> complicate
>> the situation somewhat, the OCF/Heartbeat/iSCSITarget "resource agent"
>> (really
>> a standardized script that Pacemaker uses to control LIO IQNs) doesn't
>> have
>> the capability built-in to modify the "attribs" when starting a target.
>> Yes,
>> I could customize/extend this resource agent in our environment, but to do
>> so,
>> deviating from the standard source code, when the hardware queue is
>> already
>> significantly higher than the limit set in LIO, seems unneccesary.   We
>> have
>> limited the Queue depth on the ESX side, but I know we were doing that
>> with an
>> eye to the 256/960 queue depth of the LSI controller, so it's quite
>> possible
>> that it is set higher than the 64 default of LIO.  I'll look into that.
>>
>
> The iscsi_target_mod default_cmdsn_depth=64 per IQN/TPGT endpoint
> assumes a small number of 10 Gb/sec hosts.  Based on your hardware
> configuration, it sounds like it needs to be much smaller.
>
> If your H/A scripts are not currently capable of setting
> default_cmdsn_depth, then you'll need to either hard-code it to a
> smaller value in iscsi_target_mod, or set default_cmdsn_depth + NodeACL
> cmdsn_depth via configfs before LUN exports are made active with:
>
>     echo 1 > ../target/iscsi/$WWN/$TPGT/enabled.
>
>> 3. We disabled VAAI long ago, as it certainly did exasperate the problem.
>>
>> 4. We are only using a single LUN per IQN.  We are, however, using 4 IQNs
>> per
>> server, and two IPs (different subnets) per IQN.  We have ensured that
>> VMWare
>> is not load balancing between the different paths, only using the
>> non-active
>> path if the primary path happens to become unavailable...  This was done
>> this
>> way because we wanted to be able to migrate individual arrays between
>> cluster
>> nodes, without having to move them all at once.. This precludes using a
>> single
>> IQN with multiple associated LUNs.
>>
>>
>> I believe, while tuning the system to its best should stop timeouts from
>> happening under theoretical, ideal conditions, that under real world
>> conditions, when drives fail, or other "hiccups" happen, LIO is unable to
>> allow ESX to recover from such events without stopping and restarting,
>> which
>> is why I've asked if it's possible to allow LIO to operate with "an
>> exception
>> to the strict rules of the SCSI SPEC".  It's not about handling it when
>> things
>> are working at their optimum, it's about how the connections are handled
>> when
>> something goes wrong.
>>
>
> I don't think allowing LIO to submit new backend I/Os when previous ones
> haven't yet completed is going to solve this issue.
>
> If a backend I/O request does not complete within a fixed amount of time
> defined by fabric host timeouts, sending yet more backend I/O will not
> address the underlying problem of why the backend is not able to
> complete outstanding I/Os in a timely fashion.
>
> --nab
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html