Hi all, thank you for sharing all the interesting tips and ideas. I agree with Steve that there're two different issues. It makes sense to reduce default_cmdsn_depth if the backend storage is overloaded and cannot handle more outstanding I/O in a timely manner. However, this doesn't help in case of temporary backend outages like RAID disk or Ceph node failure where we know we definitely exceed 5 secs timeout and want to reset the sessions. ESXi does quite well when recovering from APD conditions but it seems not to be this situation. Steve, I was testing the same Pacemaker+DRBD setup as you in 2011 and decided to rewrite target resource agent from scratch. The original one was too unreliable and slow. (Sorry, I cannot provide it to public.) Note that I never saw ABORT_TASKs when running this setup on our Dell hardware. Martin Dne 19.8.2015 v 10:22 Steve Beaudry napsal(a): > Thanks Nicolas, > > I'll modify the resource agent script so that the default_cmdsn_depth can be set, and reduce > the value to 16, based on your recommendation, amd see what impact it has. > > I do still believe that we're talking about two different problems, one being performance and > outstanding IOs timing out, while the other being a seeming incompatibility between LIO and ESX > with regards to handling sessions when VMWare decides to restart a session which it does for a > number of reasons (really, in response to any number of SCSI errors that pop). > > ...Steve... > > > -------- Original message -------- > From: "Nicholas A. Bellinger" > Date:08-19-2015 12:04 AM (GMT-08:00) > To: Steve Beaudry > Cc: Alex Gorbachev ,Martin Svec ,target-devel@xxxxxxxxxxxxxxx > Subject: Re: ESXi + LIO + Ceph RBD problem > > On Wed, 2015-08-19 at 06:12 +0000, Steve Beaudry wrote: > > Thanks for the tips Nicholas, > > > > We've already been down the road of improving the performance based on what > > you've mentioned, at least nearly everything... > > > > 1. The backend storage are arrays of dedicated disks, connected through about > > the top end RAID cards from LSI, battery backed, write-back caching, and > > including 400GB SSD drives doing read acceleration for "hot" data. These > > arrays are replicated using DRBD, across two separate hosts.. Read-balancing > > is enabled in drbd, so both hosts are used when reading data (typically reads > > are being striped across 8-10 disks)... Under "normal" circumstances, the > > backend storage is very fast. Unfortunately, things happen... Seagate drives > > are failing at a ridiculous rate (a separate issue with Seagate that they are > > addressing). When a drive on either host fails, it can cause a timeout > > significantly longer than "normal". We've also seen other reasons for > > timeouts occasionally occurring, and the end result is, because of a sequence > > of events, a small timeout relating to a hardware RAID controller is causing > > entire VMWare datacenters to hang, because the ESX server cannot restart the > > connection to the LUNs, which is seemingly their method of dealing with > > "hiccups". > > > > 2. The hardware queue depth of the LSI-9286CV-8eCC cards is 960, (with 256 per > > array) default queue depth of 64 of LIO shouldn't be killing it. > > Using scsi-mq with LSI LLD code would help here, as Scsi_Host->host_lock > and request_queue->queue_lock contention with many LUNs per Scsi_Host > also increases I/O latency. > > > Unfortunately, there is some multiplication of that number, as we are running > > 4 IQNs per host, so LIO is likely generating 256 in-flight commands to the > > backend storage, spread across 4 arrays... still well under the 960 queue > > depth the card is supposed to be capable of handling.. Because the IQNs/LUNs > > are under the control of the Pacemaker cluster manager, setup of the IQNs/LUNs > > happens immediately prior to the connections becoming active, so changing the > > value of /sys/kernel/target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth > > manually before the connection becomes active is not possible. To complicate > > the situation somewhat, the OCF/Heartbeat/iSCSITarget "resource agent" (really > > a standardized script that Pacemaker uses to control LIO IQNs) doesn't have > > the capability built-in to modify the "attribs" when starting a target. Yes, > > I could customize/extend this resource agent in our environment, but to do so, > > deviating from the standard source code, when the hardware queue is already > > significantly higher than the limit set in LIO, seems unneccesary. We have > > limited the Queue depth on the ESX side, but I know we were doing that with an > > eye to the 256/960 queue depth of the LSI controller, so it's quite possible > > that it is set higher than the 64 default of LIO. I'll look into that. > > > > The iscsi_target_mod default_cmdsn_depth=64 per IQN/TPGT endpoint > assumes a small number of 10 Gb/sec hosts. Based on your hardware > configuration, it sounds like it needs to be much smaller. > > If your H/A scripts are not currently capable of setting > default_cmdsn_depth, then you'll need to either hard-code it to a > smaller value in iscsi_target_mod, or set default_cmdsn_depth + NodeACL > cmdsn_depth via configfs before LUN exports are made active with: > > echo 1 > ../target/iscsi/$WWN/$TPGT/enabled. > > > 3. We disabled VAAI long ago, as it certainly did exasperate the problem. > > > > 4. We are only using a single LUN per IQN. We are, however, using 4 IQNs per > > server, and two IPs (different subnets) per IQN. We have ensured that VMWare > > is not load balancing between the different paths, only using the non-active > > path if the primary path happens to become unavailable... This was done this > > way because we wanted to be able to migrate individual arrays between cluster > > nodes, without having to move them all at once.. This precludes using a single > > IQN with multiple associated LUNs. > > > > > > I believe, while tuning the system to its best should stop timeouts from > > happening under theoretical, ideal conditions, that under real world > > conditions, when drives fail, or other "hiccups" happen, LIO is unable to > > allow ESX to recover from such events without stopping and restarting, which > > is why I've asked if it's possible to allow LIO to operate with "an exception > > to the strict rules of the SCSI SPEC". It's not about handling it when things > > are working at their optimum, it's about how the connections are handled when > > something goes wrong. > > > > I don't think allowing LIO to submit new backend I/Os when previous ones > haven't yet completed is going to solve this issue. > > If a backend I/O request does not complete within a fixed amount of time > defined by fabric host timeouts, sending yet more backend I/O will not > address the underlying problem of why the backend is not able to > complete outstanding I/Os in a timely fashion. > > --nab > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html