Re: [Open-FCoE] System crashes with increased drive count

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 05 Jun 2014 15:43:48 -0700

On Wed, 2014-06-04 at 17:22 -0700, Jun Wu wrote:
> Is there design limit for the number of target drives that we should
> not cross? Is 10 a reasonable number? We did notice that lower number
> of target has less problems from our testing.
> 

It completely depends on the fabric, initiator, and backend storage.

For example, some initiators (like ESX iSCSI on 1 Gb/sec ethernet) have
a problem with more than a handful of LUNs per session, that can result
in false positive timeouts on the initiator side under heavy I/O loads
due to fairness issues + I/Os not being sent out of the initiator fast
enough.

Other initiators like Qlogic FC are able to run 256 LUNs in a single
session + endpoint without issues.

> Are there any additional tests that we can do to narrow down the
> problem? For example try different IO types, random vs sequential,
> read vs write. Would that help?
> 

If the issue is really related to the number of outstanding I/Os on your
network, one easy thing to do is reduce the default queue_depth=3 to
queue_depth=1 for each LUN on the initiator side, and see if that has
any effect.

I don't recall where these values are in /sys for fcoe, but are easy to
find using 'find /sys -name queue_depth'.  Go ahead and set each of
these for your fcoe initiators LUNs to queue_depth=1 + retest.

Also note that these values are not persistent across restart.

> Nab,
> We cannot change the connection between the servers. They are bare
> metal cloud servers that we don't have direct access.
> 

That's a shame, as it would certainly help isolate individual networking
components.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html