Follow up to this... Thinking about Queue depths has made me remember something somewhat relevant to our situation, but likely unrelated to LIO specifically.. Still, I wanted to mention it for the sake of completeness, as I mentioned that we'd limited queue depth in VMWare ESX... We're normally only seeing problems while our backups are running.. Our backups are not limited by the VMWare ESX Queue depth, as we use a backup product that connects to the iSCSI directly from a different host (coordinating with VMWare, taking snapshots to avoid concurrent access problems). While we've limited the Queue depth of the ESX servers, we've never been able to find a way to limit the queue depth of the backup software. While taking a snapshot deals with the "concurrent access to a file" problem, it does not deal with the fact that our backup server has the ability to somewhat swamp the LUN with requests. VMWare's documentation notes that higher than normal latency is common during this type of activity, and triggering TIMEOUTS is quite possibly normal.. Unfortunately, again, it seems that VMWare sees it as a perfectly normal method of dealing with a "hiccup" to reset the LUN connection, which is where my original request came in... -----Original Message----- From: target-devel-owner@xxxxxxxxxxxxxxx [mailto:target-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Steve Beaudry Sent: August 18, 2015 11:13 PM To: Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> Cc: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Martin Svec <martin.svec@xxxxxxxx>; target-devel@xxxxxxxxxxxxxxx Subject: RE: ESXi + LIO + Ceph RBD problem Thanks for the tips Nicholas, We've already been down the road of improving the performance based on what you've mentioned, at least nearly everything... 1. The backend storage are arrays of dedicated disks, connected through about the top end RAID cards from LSI, battery backed, write-back caching, and including 400GB SSD drives doing read acceleration for "hot" data. These arrays are replicated using DRBD, across two separate hosts.. Read-balancing is enabled in drbd, so both hosts are used when reading data (typically reads are being striped across 8-10 disks)... Under "normal" circumstances, the backend storage is very fast. Unfortunately, things happen... Seagate drives are failing at a ridiculous rate (a separate issue with Seagate that they are addressing). When a drive on either host fails, it can cause a timeout significantly longer than "normal". We've also seen other reasons for timeouts occasionally occurring, and the end result is, because of a sequence of events, a small timeout relating to a hardware RAID controller is causing entire VMWare datacenters to hang, because the ESX server cannot restart the connection to the LUNs, which is seemingly their method of dealing with "hiccups". 2. The hardware queue depth of the LSI-9286CV-8eCC cards is 960, (with 256 per array) default queue depth of 64 of LIO shouldn't be killing it. Unfortunately, there is some multiplication of that number, as we are running 4 IQNs per host, so LIO is likely generating 256 in-flight commands to the backend storage, spread across 4 arrays... still well under the 960 queue depth the card is supposed to be capable of handling.. Because the IQNs/LUNs are under the control of the Pacemaker cluster manager, setup of the IQNs/LUNs happens immediately prior to the connections becoming active, so changing the value of /sys/kernel/target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth manually before the connection becomes active is not possible. To complicate the situation somewhat, the OCF/Heartbeat/iSCSITarget "resource agent" (really a standardized script that Pacemaker uses to control LIO IQNs) doesn't have the capability built-in to modify the "attribs" when starting a target. Yes, I could customize/extend this resource agent in our environment, but to do so, deviating from the standard source code, when the hardware queue is already significantly higher than the limit set in LIO, seems unneccesary. We have limited the Queue depth on the ESX side, but I know we were doing that with an eye to the 256/960 queue depth of the LSI controller, so it's quite possible that it is set higher than the 64 default of LIO. I'll look into that. 3. We disabled VAAI long ago, as it certainly did exasperate the problem. 4. We are only using a single LUN per IQN. We are, however, using 4 IQNs per server, and two IPs (different subnets) per IQN. We have ensured that VMWare is not load balancing between the different paths, only using the non-active path if the primary path happens to become unavailable... This was done this way because we wanted to be able to migrate individual arrays between cluster nodes, without having to move them all at once.. This precludes using a single IQN with multiple associated LUNs. I believe, while tuning the system to its best should stop timeouts from happening under theoretical, ideal conditions, that under real world conditions, when drives fail, or other "hiccups" happen, LIO is unable to allow ESX to recover from such events without stopping and restarting, which is why I've asked if it's possible to allow LIO to operate with "an exception to the strict rules of the SCSI SPEC". It's not about handling it when things are working at their optimum, it's about how the connections are handled when something goes wrong. As a small aside, you're quite correct that if there is no timeout on the storage backend, there is no problem.. We have the same setup running on 4 identical storage hosts, with the only difference being that the spinning disks in those hosts are replaced with 400GB Intel DCS3700 SSD drives, and we've never, ever, encountered a timeout on those arrays (go figure). I suspect though, if we should lose a drive, and things should slow down due to a RAID card suddenly needing to read every other drive, and calculate the resulting missing bit, that even the SSD based hosts may suffer the same "need to restart the target" symptom. Cheers, and thank you again for the work you've done on this, and for the pointers and help thus far... ...Steve... -----Original Message----- From: Nicholas A. Bellinger [mailto:nab@xxxxxxxxxxxxxxx] Sent: August 18, 2015 9:25 PM To: Steve Beaudry <Steve.Beaudry@xxxxxxxxxxxxx> Cc: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Martin Svec <martin.svec@xxxxxxxx>; target-devel@xxxxxxxxxxxxxxx Subject: Re: ESXi + LIO + Ceph RBD problem On Mon, 2015-08-17 at 22:10 +0000, Steve Beaudry wrote: > Hey Guys, > > We're seeing exactly the same behaviour using ESXi + LIO + DRBD, > using pacemaker/corosync to control the cluster... > > Under periods of heavy load (typically during backups), we > occasionally see warnings in the logs exactly as you've mentioned: > > > [ 3052.065353] ABORT_TASK: Found referenced iSCSI task_tag: 801219 [ > > [ 3052.066370] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: > > [ 3082.714529] ABORT_TASK: Found referenced iSCSI task_tag: > > [ 3082.714532] ABORT_TASK: ref_tag: 801223 already complete, > > skipping [ 3082.714533] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST > > for ref_tag: 801223 [ 3082.714536] ABORT_TASK: Found referenced iSCSI > > task_tag: 801222 [ 3082.714540] ABORT_TASK: Sending > > TMR_FUNCTION_COMPLETE for ref_tag: 801222 > > We setup monitoring scripts that watch for these sorts of entries, > followed by the inevitable LUN RESCAN that ESXi will perform when it > can't talk to one of it's disks : > > [261204.802785] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access > for > 0x00000007 > [261204.805443] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access > for > 0x00000008 > [261204.806166] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access > for > 0x00000009 > [261204.809172] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access > for 0x0000000a > etc... for the next 200 or so lines.. > > The only way we've found to deal with this is to migrate our primary > storage to the second host in the cluster, unceremoniously killing the > iSCSI stack on the initial host, and starting it on the second host. > All this is REALLY accomplishing is resetting the connections, and letting > ESXi reconnect. > > We're fairly heavily invested in this setup, and my question is, is > there a way to set a flag somewhere, or tweak a setting of code, to > allow LIO to violate the strict rules of the SCSI SPEC, to allow this > setup to work? I'm going to HAVE to find a way around this very > shortly, and I'd really rather the option not be "replace the in > kernel iSCSI stack with TGT or SCST", because they allow that sort of thing. > To clarify, following iscsi spec in the previous ceph discussion meant iscsi-target code waits for outstanding in-flight I/Os to complete, before allowing a new session reinstatement to retry the same I/Os again. Having LIO 'violate the strict rules' in this context is not going to prevent a ESX host from hitting internal SCSI timeout, when a backend I/O takes longer than 5 seconds to respond under load. So, if a backend storage configuration can't keep up with forward facing I/O requirements, you need to: * Reduce ../target/iscsi/$IQN/$WWN/$TPGT/attrib/default_cmdsn_depth and/or initiator NodeACL cmdsn_depth This attribute controls how many I/Os a single iscsi initiator can keep in flight at a given time. Current default is 64, try 16 and work backwards in power of 2s from there. Note any active sessions needs to be re-started for changes to this value have effect. * Disable VAAI clone and/or zero primitive emulation Clone operations doing local EXTENDED_COPY emulation can generate significant back-end I/O load, and disabling it for some or all backends might help mask the unbounded I/O latency to ESX. ../target/core/$HBA/$DEV/attrib/emulate_3pc controls reporting the EXTENDED_COPY feature bit to ESX. You can verify with esxtop that it has been disabled. * Use a single LUN mapping per IQN/TPGT endpoint The ESX iSCSI initiator has well known fairness issues that can generate false positive internal SCSI timeouts under load, when multiple LUN mappings exist on the same target IQN/TPGT endpoint. To avoid hitting these types of ESX false positives, using a single LUN (LUN=0) mapping per target IQN/TPGT endpoint might also help. But keep in mind these tunables will only mask the issue of unbounded backend I/O latency. If backend I/O latency is on the order of minutes, and not just a few seconds currently over the expected 5 second internal timeout, changing iscsi-target code is still not going to make ESX happy. --nab
Attachment:
smime.p7s
Description: S/MIME cryptographic signature