On Wed, 2015-08-19 at 06:12 +0000, Steve Beaudry wrote: > Thanks for the tips Nicholas, > > We've already been down the road of improving the performance based on what > you've mentioned, at least nearly everything... > > 1. The backend storage are arrays of dedicated disks, connected through about > the top end RAID cards from LSI, battery backed, write-back caching, and > including 400GB SSD drives doing read acceleration for "hot" data. These > arrays are replicated using DRBD, across two separate hosts.. Read-balancing > is enabled in drbd, so both hosts are used when reading data (typically reads > are being striped across 8-10 disks)... Under "normal" circumstances, the > backend storage is very fast. Unfortunately, things happen... Seagate drives > are failing at a ridiculous rate (a separate issue with Seagate that they are > addressing). When a drive on either host fails, it can cause a timeout > significantly longer than "normal". We've also seen other reasons for > timeouts occasionally occurring, and the end result is, because of a sequence > of events, a small timeout relating to a hardware RAID controller is causing > entire VMWare datacenters to hang, because the ESX server cannot restart the > connection to the LUNs, which is seemingly their method of dealing with > "hiccups". > > 2. The hardware queue depth of the LSI-9286CV-8eCC cards is 960, (with 256 per > array) default queue depth of 64 of LIO shouldn't be killing it. Using scsi-mq with LSI LLD code would help here, as Scsi_Host->host_lock and request_queue->queue_lock contention with many LUNs per Scsi_Host also increases I/O latency. > Unfortunately, there is some multiplication of that number, as we are running > 4 IQNs per host, so LIO is likely generating 256 in-flight commands to the > backend storage, spread across 4 arrays... still well under the 960 queue > depth the card is supposed to be capable of handling.. Because the IQNs/LUNs > are under the control of the Pacemaker cluster manager, setup of the IQNs/LUNs > happens immediately prior to the connections becoming active, so changing the > value of /sys/kernel/target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth > manually before the connection becomes active is not possible. To complicate > the situation somewhat, the OCF/Heartbeat/iSCSITarget "resource agent" (really > a standardized script that Pacemaker uses to control LIO IQNs) doesn't have > the capability built-in to modify the "attribs" when starting a target. Yes, > I could customize/extend this resource agent in our environment, but to do so, > deviating from the standard source code, when the hardware queue is already > significantly higher than the limit set in LIO, seems unneccesary. We have > limited the Queue depth on the ESX side, but I know we were doing that with an > eye to the 256/960 queue depth of the LSI controller, so it's quite possible > that it is set higher than the 64 default of LIO. I'll look into that. > The iscsi_target_mod default_cmdsn_depth=64 per IQN/TPGT endpoint assumes a small number of 10 Gb/sec hosts. Based on your hardware configuration, it sounds like it needs to be much smaller. If your H/A scripts are not currently capable of setting default_cmdsn_depth, then you'll need to either hard-code it to a smaller value in iscsi_target_mod, or set default_cmdsn_depth + NodeACL cmdsn_depth via configfs before LUN exports are made active with: echo 1 > ../target/iscsi/$WWN/$TPGT/enabled. > 3. We disabled VAAI long ago, as it certainly did exasperate the problem. > > 4. We are only using a single LUN per IQN. We are, however, using 4 IQNs per > server, and two IPs (different subnets) per IQN. We have ensured that VMWare > is not load balancing between the different paths, only using the non-active > path if the primary path happens to become unavailable... This was done this > way because we wanted to be able to migrate individual arrays between cluster > nodes, without having to move them all at once.. This precludes using a single > IQN with multiple associated LUNs. > > > I believe, while tuning the system to its best should stop timeouts from > happening under theoretical, ideal conditions, that under real world > conditions, when drives fail, or other "hiccups" happen, LIO is unable to > allow ESX to recover from such events without stopping and restarting, which > is why I've asked if it's possible to allow LIO to operate with "an exception > to the strict rules of the SCSI SPEC". It's not about handling it when things > are working at their optimum, it's about how the connections are handled when > something goes wrong. > I don't think allowing LIO to submit new backend I/Os when previous ones haven't yet completed is going to solve this issue. If a backend I/O request does not complete within a fixed amount of time defined by fabric host timeouts, sending yet more backend I/O will not address the underlying problem of why the backend is not able to complete outstanding I/Os in a timely fashion. --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html