Re: [Open-FCoE] System crashes with increased drive count

Jun Wu <jwu@xxxxxxxxxxxx> · Fri, 6 Jun 2014 14:45:43 -0700

The queue_depth is 32 by default. So all my previous tests were based
on 32 queue_depth.

The result of my tests today confirms that with higher queue_depth,
there are more aborts on the initiator side and corresponding
"Exchange timer armed : 0 msecs" messages on the target side.
I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort
or any abnormal messages. For 3, there are 15 instances of aborts and
15 instances of "0 msecs" messages. When queue_depth increased to 6,
there are 81 instances of the aborts and 81 instances of "0 msecs"
messages for the same fio test.

There seems to be a one to one correspondence between the abort
message on the initiator size and the "0 msecs" message on the target
side. The xid's are the same:

On the target side:
[root@poc2 log]# grep "0 msec" messages
Jun  6 11:56:05 poc2 kernel: [ 4436.866024] host7: xid  88b: Exchange
timer armed : 0 msecs
Jun  6 11:56:05 poc2 kernel: [ 4436.866263] host7: xid  905: Exchange
timer armed : 0 msecs
Jun  6 11:56:05 poc2 kernel: [ 4436.866563] host7: xid  aa4: Exchange
timer armed : 0 msecs
Jun  6 11:56:05 poc2 kernel: [ 4436.866740] host7: xid  be1: Exchange
timer armed : 0 msecs
Jun  6 11:56:05 poc2 kernel: [ 4436.866972] host7: xid  925: Exchange
timer armed : 0 msecs
Jun  6 11:56:05 poc2 kernel: [ 4436.867137] host7: xid  627: Exchange
timer armed : 0 msecs
Jun  6 11:57:27 poc2 kernel: [ 4518.995622] host7: xid  926: Exchange
timer armed : 0 msecs
Jun  6 11:57:27 poc2 kernel: [ 4518.995919] host7: xid  be8: Exchange
timer armed : 0 msecs
Jun  6 11:57:27 poc2 kernel: [ 4518.996231] host7: xid  229: Exchange
timer armed : 0 msecs
Jun  6 11:57:27 poc2 kernel: [ 4518.996500] host7: xid  4cc: Exchange
timer armed : 0 msecs
Jun  6 11:57:27 poc2 kernel: [ 4518.996763] host7: xid  a2a: Exchange
timer armed : 0 msecs
Jun  6 12:14:54 poc2 kernel: [ 5566.776388] host7: xid  34e: Exchange
timer armed : 0 msecs
Jun  6 12:14:54 poc2 kernel: [ 5566.776657] host7: xid  9e2: Exchange
timer armed : 0 msecs
Jun  6 12:14:54 poc2 kernel: [ 5566.776950] host7: xid  964: Exchange
timer armed : 0 msecs
Jun  6 12:14:54 poc2 kernel: [ 5566.777172] host7: xid  a8a: Exchange
timer armed : 0 msecs

On the initiator side:
8890680:Jun  6 11:56:05 poc1 kernel: [ 4899.764511] host7: xid  745:
f_ctl  90008 seq  2
8890681:Jun  6 11:56:05 poc1 kernel: [ 4899.764711] host7: xid  745:
exch: BLS rctl 84 - BLS accept
8890682:Jun  6 11:56:05 poc1 kernel: [ 4899.764753] host7: fcp:
00061e: xid 0745-088b: target abort cmd  passed
8890683:Jun  6 11:56:05 poc1 kernel: [ 4899.764758] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890684:Jun  6 11:56:05 poc1 kernel: [ 4899.764762] host7: xid  ae1:
f_ctl  90008 seq  2
8890685:Jun  6 11:56:05 poc1 kernel: [ 4899.764927] host7: xid  ae1:
exch: BLS rctl 84 - BLS accept
8890686:Jun  6 11:56:05 poc1 kernel: [ 4899.764985] host7: fcp:
00061e: xid 0ae1-0905: target abort cmd  passed
8890687:Jun  6 11:56:05 poc1 kernel: [ 4899.764990] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890688:Jun  6 11:56:05 poc1 kernel: [ 4899.764994] host7: xid  5a0:
f_ctl  90008 seq  2
8890689:Jun  6 11:56:05 poc1 kernel: [ 4899.765161] host7: xid  5a0:
exch: BLS rctl 84 - BLS accept
8890690:Jun  6 11:56:05 poc1 kernel: [ 4899.765210] host7: fcp:
00061e: xid 05a0-0aa4: target abort cmd  passed
8890691:Jun  6 11:56:05 poc1 kernel: [ 4899.765215] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890692:Jun  6 11:56:05 poc1 kernel: [ 4899.765219] host7: xid  e4c:
f_ctl  90008 seq  2
8890693:Jun  6 11:56:05 poc1 kernel: [ 4899.765424] host7: xid  e4c:
exch: BLS rctl 84 - BLS accept
8890694:Jun  6 11:56:05 poc1 kernel: [ 4899.765464] host7: fcp:
00061e: xid 0e4c-0be1: target abort cmd  passed
8890695:Jun  6 11:56:05 poc1 kernel: [ 4899.765469] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890696:Jun  6 11:56:05 poc1 kernel: [ 4899.765473] host7: xid  967:
f_ctl  90008 seq  2
8890697:Jun  6 11:56:05 poc1 kernel: [ 4899.765568] host7: xid  967:
exch: BLS rctl 84 - BLS accept
8890698:Jun  6 11:56:05 poc1 kernel: [ 4899.765618] host7: fcp:
00061e: xid 0967-0925: target abort cmd  passed
8890699:Jun  6 11:56:05 poc1 kernel: [ 4899.765624] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890700:Jun  6 11:56:05 poc1 kernel: [ 4899.765628] host7: xid  706:
f_ctl  90008 seq  2
8890701:Jun  6 11:56:05 poc1 kernel: [ 4899.765730] host7: xid  706:
exch: BLS rctl 84 - BLS accept
8890702:Jun  6 11:56:05 poc1 kernel: [ 4899.765782] host7: fcp:
00061e: xid 0706-0627: target abort cmd  passed
8890703:Jun  6 11:56:05 poc1 kernel: [ 4899.765787] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278978:Jun  6 11:57:27 poc1 kernel: [ 4981.894046] host7: xid  c26:
f_ctl  90008 seq  2
9278979:Jun  6 11:57:27 poc1 kernel: [ 4981.894288] host7: xid  c26:
exch: BLS rctl 84 - BLS accept
9278980:Jun  6 11:57:27 poc1 kernel: [ 4981.894335] host7: fcp:
00061e: xid 0c26-0926: target abort cmd  passed
9278981:Jun  6 11:57:27 poc1 kernel: [ 4981.894341] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278982:Jun  6 11:57:27 poc1 kernel: [ 4981.894346] host7: xid  b84:
f_ctl  90008 seq  2
9278983:Jun  6 11:57:27 poc1 kernel: [ 4981.894571] host7: xid  b84:
exch: BLS rctl 84 - BLS accept
9278984:Jun  6 11:57:27 poc1 kernel: [ 4981.894614] host7: fcp:
00061e: xid 0b84-0be8: target abort cmd  passed
9278985:Jun  6 11:57:27 poc1 kernel: [ 4981.894618] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278986:Jun  6 11:57:27 poc1 kernel: [ 4981.894622] host7: xid  9ad:
f_ctl  90008 seq  2
9278987:Jun  6 11:57:27 poc1 kernel: [ 4981.894875] host7: xid  9ad:
exch: BLS rctl 84 - BLS accept
9278988:Jun  6 11:57:27 poc1 kernel: [ 4981.894920] host7: fcp:
00061e: xid 09ad-0229: target abort cmd  passed
9278989:Jun  6 11:57:27 poc1 kernel: [ 4981.894925] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278990:Jun  6 11:57:27 poc1 kernel: [ 4981.894928] host7: xid  ae0:
f_ctl  90008 seq  2
9278991:Jun  6 11:57:27 poc1 kernel: [ 4981.895103] host7: xid  ae0:
exch: BLS rctl 84 - BLS accept
9278992:Jun  6 11:57:27 poc1 kernel: [ 4981.895152] host7: fcp:
00061e: xid 0ae0-04cc: target abort cmd  passed
9278993:Jun  6 11:57:27 poc1 kernel: [ 4981.895158] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278994:Jun  6 11:57:27 poc1 kernel: [ 4981.895163] host7: xid  d41:
f_ctl  90008 seq  2
9278995:Jun  6 11:57:27 poc1 kernel: [ 4981.895382] host7: xid  d41:
exch: BLS rctl 84 - BLS accept
9278996:Jun  6 11:57:27 poc1 kernel: [ 4981.895432] host7: fcp:
00061e: xid 0d41-0a2a: target abort cmd  passed
9278997:Jun  6 11:57:27 poc1 kernel: [ 4981.895438] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
38409501:Jun  6 12:14:54 poc1 kernel: [ 6029.674694] host7: xid  ee2:
f_ctl  90008 seq  2
38409502:Jun  6 12:14:54 poc1 kernel: [ 6029.674956] host7: xid  ee2:
exch: BLS rctl 84 - BLS accept
38409503:Jun  6 12:14:54 poc1 kernel: [ 6029.674999] host7: fcp:
00061e: xid 0ee2-034e: target abort cmd  passed
38409504:Jun  6 12:14:54 poc1 kernel: [ 6029.675005] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
38409505:Jun  6 12:14:54 poc1 kernel: [ 6029.675009] host7: xid  221:
f_ctl  90008 seq  2
38409506:Jun  6 12:14:54 poc1 kernel: [ 6029.675221] host7: xid  221:
exch: BLS rctl 84 - BLS accept
38409507:Jun  6 12:14:54 poc1 kernel: [ 6029.675268] host7: fcp:
00061e: xid 0221-09e2: target abort cmd  passed
38409508:Jun  6 12:14:54 poc1 kernel: [ 6029.675272] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
38409509:Jun  6 12:14:54 poc1 kernel: [ 6029.675276] host7: xid  ecd:
f_ctl  90008 seq  2
38409510:Jun  6 12:14:54 poc1 kernel: [ 6029.675523] host7: xid  ecd:
exch: BLS rctl 84 - BLS accept
38409511:Jun  6 12:14:54 poc1 kernel: [ 6029.675538] host7: fcp:
00061e: xid 0ecd-0964: target abort cmd  passed
38409512:Jun  6 12:14:54 poc1 kernel: [ 6029.675542] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
38409513:Jun  6 12:14:54 poc1 kernel: [ 6029.675546] host7: xid  f02:
f_ctl  90008 seq  2
38409514:Jun  6 12:14:54 poc1 kernel: [ 6029.675639] host7: xid  f02:
exch: BLS rctl 84 - BLS accept
38409515:Jun  6 12:14:54 poc1 kernel: [ 6029.675691] host7: fcp:
00061e: xid 0f02-0a8a: target abort cmd  passed
38409516:Jun  6 12:14:54 poc1 kernel: [ 6029.675697] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED

Nab,

We have included your changes that print the timeout values:
On the target side:
40969:Jun  6 13:05:15 poc2 kernel: [  134.665057] fc_rport_create:
rdata->e_d_tov: 2000
40970:Jun  6 13:05:15 poc2 kernel: [  134.665060] fc_rport_create:
rdata->r_a_tov: 4000
40971:Jun  6 13:05:15 poc2 kernel: [  135.355969]
fc_rport_login_complete e_d_tov: 0
40972:Jun  6 13:05:15 poc2 kernel: [  135.355972] fc_rport_flogi_resp r_a_tov: 0
40973:Jun  6 13:05:15 poc2 kernel: [  135.355976]
fc_rport_enter_plogi: rdata->e_d_tov: 2000
40974:Jun  6 13:05:15 poc2 kernel: [  135.356206]
fc_rport_login_complete e_d_tov: 2000
40975:Jun  6 13:05:15 poc2 kernel: [  135.356973] fc_rport_work:
rpriv->e_d_tov: 2000
40976:Jun  6 13:05:15 poc2 kernel: [  135.356977] rpriv->r_a_tov:
rpriv->r_a_tov: 4000

On the initiator size:
950:Jun  6 13:05:15 poc1 kernel: [  574.652695] fc_rport_create:
rdata->e_d_tov: 2000
951:Jun  6 13:05:15 poc1 kernel: [  574.652698] fc_rport_create:
rdata->r_a_tov: 4000
952:Jun  6 13:05:15 poc1 kernel: [  575.344306] fc_rport_work:
rpriv->e_d_tov: 2000
953:Jun  6 13:05:15 poc1 kernel: [  575.344310] rpriv->r_a_tov:
rpriv->r_a_tov: 4000
1024:Jun  6 13:05:23 poc1 kernel: [  582.723206] fc_rport_work:
rpriv->e_d_tov: 2000
1025:Jun  6 13:05:23 poc1 kernel: [  582.723211] rpriv->r_a_tov:
rpriv->r_a_tov: 4000

When the abort occurs, "iostat -x 1" command on the target side starts
to show 0 iops on some of the drives. Sometimes all the 10 target
drives have 0 iops.

What is the reasonable number to assume for the maximum number of
outstanding IO?

Thanks,

Jun

On Fri, Jun 6, 2014 at 1:28 PM, Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> wrote:
> On Thu, 2014-06-05 at 15:43 -0700, Nicholas A. Bellinger wrote:
>> On Wed, 2014-06-04 at 17:22 -0700, Jun Wu wrote:
>> > Is there design limit for the number of target drives that we should
>> > not cross? Is 10 a reasonable number? We did notice that lower number
>> > of target has less problems from our testing.
>> >
>>
>> It completely depends on the fabric, initiator, and backend storage.
>>
>> For example, some initiators (like ESX iSCSI on 1 Gb/sec ethernet) have
>> a problem with more than a handful of LUNs per session, that can result
>> in false positive timeouts on the initiator side under heavy I/O loads
>> due to fairness issues + I/Os not being sent out of the initiator fast
>> enough.
>>
>> Other initiators like Qlogic FC are able to run 256 LUNs in a single
>> session + endpoint without issues.
>>
>> > Are there any additional tests that we can do to narrow down the
>> > problem? For example try different IO types, random vs sequential,
>> > read vs write. Would that help?
>> >
>>
>> If the issue is really related to the number of outstanding I/Os on your
>> network, one easy thing to do is reduce the default queue_depth=3 to
>> queue_depth=1 for each LUN on the initiator side, and see if that has
>> any effect.
>>
>
> Good idea and possibly that is the cause and not something else such as
> switch is the factor w/o DCB and PFC PAUSE typically used and required
> by fcoe.
>
>> I don't recall where these values are in /sys for fcoe, but are easy to
>
>
> It is at /sys/block/sdX/device/queue_depth, it is transport agnostic and
> therefore same location for fcoe also like any other disks are.
>
>> find using 'find /sys -name queue_depth'.  Go ahead and set each of
>> these for your fcoe initiators LUNs to queue_depth=1 + retest.
>>
>
>> Also note that these values are not persistent across restart.
>>
>> > Nab,
>> > We cannot change the connection between the servers. They are bare
>> > metal cloud servers that we don't have direct access.
>> >
>>
>> That's a shame, as it would certainly help isolate individual networking
>> components.
>>
>
> Yeah, that would have helped.
>
> //Vasu
>
>> --nab
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html