RE: LIO iscsi connections issues with ESXi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for the CC.  I'm snipping the relevant parts, just to highlight where I know our setup is somewhat different.

I'm also going to mention, we discovered another issue (unrelated to LIO), and have since resolved it.  I may be merely coincidence at this point, but we've yet to see the kind of timeouts we were seeing previously, when I sent the original message.  

We have 8 VMWare ESX hosts, which all connect to our in-house built SAN based on the LIO, DRBD, Pacemaker, etc..  Four of these ESX hosts are slightly newer than the other 4, and the older 4 machines had Intel gigabit network cards connecting to the iSCSI storage networks, whereas the newer ESX hosts had Intel 10Gb NICs.  We discovered that the older 1Gbps NICs were experiencing tx-queue hangs, and were being reset.  Upon further investigation, we discovered that while these NICs were on the certified Hardware Compatibility List when they were installed (VMWare 4.x), they have since been removed from the Compatibility list.  We upgraded the NICs to match the new 10Gb NICS in the newer hardware, and we haven't experienced the problem since.  Now, we only did this upgrade a week ago, but we were originally seeing the problem occur nearly every night when backup jobs ran, often multiple times during the course of the night.  In a week, we haven't seen the problem, so it definitely could be related (fingers crossed).

I'm going to respond to the points that you've made about LIO for the sake of completeness, in case it helps someone else.  I'm going to -SNIP- everything except the relevant points.

--SNIP--

	1) The backend storage is not fast enough to keep up with the workload.

	This can happen if the backend is not completing I/Os before ESX's internal SCSI timeout fires.  With ESX v5.x, the SCSI command timeout is
	5000 ms (5 seconds), and IIRC for iSCSI can't be changed.

SB - I believe this to be true. 5 seconds is a hard coded default timeout in ESX.

	Based upon your log above, there is a mix of ABORT_TASKs for commands that have already been completed, but acknowledged / not acknowledged.

SB - I agree that this is the case.

	2) The default_cmdsn_depth per iscsi endpoint is too large.

	By default starting with v3.12 code, the default_cmdsn_depth is 64 (eg:
	the number of outstanding commands that can be in flight at a given time per session).  This is configured on a per TargetName+TPGT context basis, or a per 	NodeACL context basis.

	I'd recommend trying a lower default_cmdsn_depth (say 16 or 8 or lower), in order to limit the amount of outstanding commands ESX can keep in flight at a given 	time.  Note that you'll need to restart the session in order for the changes to take effect.

I'll look into this.  I believe we turned this down in iSCSI settings on the ESX hosts awhile ago, but I'll verify.  Thanks. 

	3) There are too many LUNs on a single target export, causing the ESX initiator to hit internal false positive timeouts.

	ESX has a known issue that if too many LUNs are exported on a single
	TargetName+TPGT endpoint (say > 8 LUNs per endpoint), it will begin to
	hit false positive timeouts internally, due to scheduling fairness issues within the ESX SCSI host subsystem.

	For 10 Gb/sec ports, I'd recommend keeping <= 4 LUNs per target endpoint in order to avoid these types of false positives.  This can also depend on how many 	total TargetName+TPGT endpoints have been configured. 

This shouldn't be an issue for us, as we're using a single LUN per target endpoint, although we do have 8 targets running on each server concurrently.

---SNIP---

Thank you very much for the feedback thus far.

Cheers,

...Steve...

Stephen Beaudry, Manager
Server, Network and Telecom Infrastructures | Royal Roads University
T 250.391.2600 ext. 4149 
2005 Sooke Road, Victoria, BC  Canada  V9B 5Y2 | royalroads.ca
 
LIFE.CHANGING


��.n��������+%������w��{.n����j�����{ay�ʇڙ���f���h������_�(�階�ݢj"��������G����?���&��





[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux