Re: ESXi + LIO + Ceph RBD problem

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Wed, 19 Aug 2015 00:04:43 -0700

On Wed, 2015-08-19 at 06:12 +0000, Steve Beaudry wrote:
> Thanks for the tips Nicholas,
> 
>   We've already been down the road of improving the performance based on what 
> you've mentioned, at least nearly everything...
> 
> 1. The backend storage are arrays of dedicated disks, connected through about 
> the top end RAID cards from LSI, battery backed, write-back caching, and 
> including 400GB SSD drives doing read acceleration for "hot" data.  These 
> arrays are replicated using DRBD, across two separate hosts.. Read-balancing 
> is enabled in drbd, so both hosts are used when reading data (typically reads 
> are being striped across 8-10 disks)... Under "normal" circumstances, the 
> backend storage is very fast.  Unfortunately, things happen... Seagate drives 
> are failing at a ridiculous rate (a separate issue with Seagate that they are 
> addressing).  When a drive on either host fails, it can cause a timeout 
> significantly longer than "normal".  We've also seen other reasons for 
> timeouts occasionally occurring, and the end result is, because of a sequence 
> of events, a small timeout relating to a hardware RAID controller is causing 
> entire VMWare datacenters to hang, because the ESX server cannot restart the 
> connection to the LUNs, which is seemingly their method of dealing with 
> "hiccups".
>
> 2. The hardware queue depth of the LSI-9286CV-8eCC cards is 960, (with 256 per 
> array) default queue depth of 64 of LIO shouldn't be killing it. 

Using scsi-mq with LSI LLD code would help here, as Scsi_Host->host_lock
and request_queue->queue_lock contention with many LUNs per Scsi_Host
also increases I/O latency.

> Unfortunately, there is some multiplication of that number, as we are running 
> 4 IQNs per host, so LIO is likely generating 256 in-flight commands to the 
> backend storage, spread across 4 arrays... still well under the 960 queue 
> depth the card is supposed to be capable of handling..   Because the IQNs/LUNs 
> are under the control of the Pacemaker cluster manager, setup of the IQNs/LUNs 
> happens immediately prior to the connections becoming active, so changing the 
> value of /sys/kernel/target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth 
> manually before the connection becomes active is not possible.  To complicate 
> the situation somewhat, the OCF/Heartbeat/iSCSITarget "resource agent" (really 
> a standardized script that Pacemaker uses to control LIO IQNs) doesn't have 
> the capability built-in to modify the "attribs" when starting a target.  Yes, 
> I could customize/extend this resource agent in our environment, but to do so, 
> deviating from the standard source code, when the hardware queue is already 
> significantly higher than the limit set in LIO, seems unneccesary.   We have 
> limited the Queue depth on the ESX side, but I know we were doing that with an 
> eye to the 256/960 queue depth of the LSI controller, so it's quite possible 
> that it is set higher than the 64 default of LIO.  I'll look into that.
> 

The iscsi_target_mod default_cmdsn_depth=64 per IQN/TPGT endpoint
assumes a small number of 10 Gb/sec hosts.  Based on your hardware
configuration, it sounds like it needs to be much smaller.

If your H/A scripts are not currently capable of setting
default_cmdsn_depth, then you'll need to either hard-code it to a
smaller value in iscsi_target_mod, or set default_cmdsn_depth + NodeACL
cmdsn_depth via configfs before LUN exports are made active with:

    echo 1 > ../target/iscsi/$WWN/$TPGT/enabled.

> 3. We disabled VAAI long ago, as it certainly did exasperate the problem.
> 
> 4. We are only using a single LUN per IQN.  We are, however, using 4 IQNs per 
> server, and two IPs (different subnets) per IQN.  We have ensured that VMWare 
> is not load balancing between the different paths, only using the non-active 
> path if the primary path happens to become unavailable...  This was done this 
> way because we wanted to be able to migrate individual arrays between cluster 
> nodes, without having to move them all at once.. This precludes using a single 
> IQN with multiple associated LUNs.
> 
> 
> I believe, while tuning the system to its best should stop timeouts from 
> happening under theoretical, ideal conditions, that under real world 
> conditions, when drives fail, or other "hiccups" happen, LIO is unable to 
> allow ESX to recover from such events without stopping and restarting, which 
> is why I've asked if it's possible to allow LIO to operate with "an exception 
> to the strict rules of the SCSI SPEC".  It's not about handling it when things 
> are working at their optimum, it's about how the connections are handled when 
> something goes wrong.
> 

I don't think allowing LIO to submit new backend I/Os when previous ones
haven't yet completed is going to solve this issue.  

If a backend I/O request does not complete within a fixed amount of time
defined by fabric host timeouts, sending yet more backend I/O will not
address the underlying problem of why the backend is not able to
complete outstanding I/Os in a timely fashion.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html