The queue_depth is 32 by default. So all my previous tests were based on 32 queue_depth. The result of my tests today confirms that with higher queue_depth, there are more aborts on the initiator side and corresponding "Exchange timer armed : 0 msecs" messages on the target side. I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort or any abnormal messages. For 3, there are 15 instances of aborts and 15 instances of "0 msecs" messages. When queue_depth increased to 6, there are 81 instances of the aborts and 81 instances of "0 msecs" messages for the same fio test. There seems to be a one to one correspondence between the abort message on the initiator size and the "0 msecs" message on the target side. The xid's are the same: On the target side: [root@poc2 log]# grep "0 msec" messages Jun 6 11:56:05 poc2 kernel: [ 4436.866024] host7: xid 88b: Exchange timer armed : 0 msecs Jun 6 11:56:05 poc2 kernel: [ 4436.866263] host7: xid 905: Exchange timer armed : 0 msecs Jun 6 11:56:05 poc2 kernel: [ 4436.866563] host7: xid aa4: Exchange timer armed : 0 msecs Jun 6 11:56:05 poc2 kernel: [ 4436.866740] host7: xid be1: Exchange timer armed : 0 msecs Jun 6 11:56:05 poc2 kernel: [ 4436.866972] host7: xid 925: Exchange timer armed : 0 msecs Jun 6 11:56:05 poc2 kernel: [ 4436.867137] host7: xid 627: Exchange timer armed : 0 msecs Jun 6 11:57:27 poc2 kernel: [ 4518.995622] host7: xid 926: Exchange timer armed : 0 msecs Jun 6 11:57:27 poc2 kernel: [ 4518.995919] host7: xid be8: Exchange timer armed : 0 msecs Jun 6 11:57:27 poc2 kernel: [ 4518.996231] host7: xid 229: Exchange timer armed : 0 msecs Jun 6 11:57:27 poc2 kernel: [ 4518.996500] host7: xid 4cc: Exchange timer armed : 0 msecs Jun 6 11:57:27 poc2 kernel: [ 4518.996763] host7: xid a2a: Exchange timer armed : 0 msecs Jun 6 12:14:54 poc2 kernel: [ 5566.776388] host7: xid 34e: Exchange timer armed : 0 msecs Jun 6 12:14:54 poc2 kernel: [ 5566.776657] host7: xid 9e2: Exchange timer armed : 0 msecs Jun 6 12:14:54 poc2 kernel: [ 5566.776950] host7: xid 964: Exchange timer armed : 0 msecs Jun 6 12:14:54 poc2 kernel: [ 5566.777172] host7: xid a8a: Exchange timer armed : 0 msecs On the initiator side: 8890680:Jun 6 11:56:05 poc1 kernel: [ 4899.764511] host7: xid 745: f_ctl 90008 seq 2 8890681:Jun 6 11:56:05 poc1 kernel: [ 4899.764711] host7: xid 745: exch: BLS rctl 84 - BLS accept 8890682:Jun 6 11:56:05 poc1 kernel: [ 4899.764753] host7: fcp: 00061e: xid 0745-088b: target abort cmd passed 8890683:Jun 6 11:56:05 poc1 kernel: [ 4899.764758] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 8890684:Jun 6 11:56:05 poc1 kernel: [ 4899.764762] host7: xid ae1: f_ctl 90008 seq 2 8890685:Jun 6 11:56:05 poc1 kernel: [ 4899.764927] host7: xid ae1: exch: BLS rctl 84 - BLS accept 8890686:Jun 6 11:56:05 poc1 kernel: [ 4899.764985] host7: fcp: 00061e: xid 0ae1-0905: target abort cmd passed 8890687:Jun 6 11:56:05 poc1 kernel: [ 4899.764990] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 8890688:Jun 6 11:56:05 poc1 kernel: [ 4899.764994] host7: xid 5a0: f_ctl 90008 seq 2 8890689:Jun 6 11:56:05 poc1 kernel: [ 4899.765161] host7: xid 5a0: exch: BLS rctl 84 - BLS accept 8890690:Jun 6 11:56:05 poc1 kernel: [ 4899.765210] host7: fcp: 00061e: xid 05a0-0aa4: target abort cmd passed 8890691:Jun 6 11:56:05 poc1 kernel: [ 4899.765215] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 8890692:Jun 6 11:56:05 poc1 kernel: [ 4899.765219] host7: xid e4c: f_ctl 90008 seq 2 8890693:Jun 6 11:56:05 poc1 kernel: [ 4899.765424] host7: xid e4c: exch: BLS rctl 84 - BLS accept 8890694:Jun 6 11:56:05 poc1 kernel: [ 4899.765464] host7: fcp: 00061e: xid 0e4c-0be1: target abort cmd passed 8890695:Jun 6 11:56:05 poc1 kernel: [ 4899.765469] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 8890696:Jun 6 11:56:05 poc1 kernel: [ 4899.765473] host7: xid 967: f_ctl 90008 seq 2 8890697:Jun 6 11:56:05 poc1 kernel: [ 4899.765568] host7: xid 967: exch: BLS rctl 84 - BLS accept 8890698:Jun 6 11:56:05 poc1 kernel: [ 4899.765618] host7: fcp: 00061e: xid 0967-0925: target abort cmd passed 8890699:Jun 6 11:56:05 poc1 kernel: [ 4899.765624] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 8890700:Jun 6 11:56:05 poc1 kernel: [ 4899.765628] host7: xid 706: f_ctl 90008 seq 2 8890701:Jun 6 11:56:05 poc1 kernel: [ 4899.765730] host7: xid 706: exch: BLS rctl 84 - BLS accept 8890702:Jun 6 11:56:05 poc1 kernel: [ 4899.765782] host7: fcp: 00061e: xid 0706-0627: target abort cmd passed 8890703:Jun 6 11:56:05 poc1 kernel: [ 4899.765787] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 9278978:Jun 6 11:57:27 poc1 kernel: [ 4981.894046] host7: xid c26: f_ctl 90008 seq 2 9278979:Jun 6 11:57:27 poc1 kernel: [ 4981.894288] host7: xid c26: exch: BLS rctl 84 - BLS accept 9278980:Jun 6 11:57:27 poc1 kernel: [ 4981.894335] host7: fcp: 00061e: xid 0c26-0926: target abort cmd passed 9278981:Jun 6 11:57:27 poc1 kernel: [ 4981.894341] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 9278982:Jun 6 11:57:27 poc1 kernel: [ 4981.894346] host7: xid b84: f_ctl 90008 seq 2 9278983:Jun 6 11:57:27 poc1 kernel: [ 4981.894571] host7: xid b84: exch: BLS rctl 84 - BLS accept 9278984:Jun 6 11:57:27 poc1 kernel: [ 4981.894614] host7: fcp: 00061e: xid 0b84-0be8: target abort cmd passed 9278985:Jun 6 11:57:27 poc1 kernel: [ 4981.894618] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 9278986:Jun 6 11:57:27 poc1 kernel: [ 4981.894622] host7: xid 9ad: f_ctl 90008 seq 2 9278987:Jun 6 11:57:27 poc1 kernel: [ 4981.894875] host7: xid 9ad: exch: BLS rctl 84 - BLS accept 9278988:Jun 6 11:57:27 poc1 kernel: [ 4981.894920] host7: fcp: 00061e: xid 09ad-0229: target abort cmd passed 9278989:Jun 6 11:57:27 poc1 kernel: [ 4981.894925] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 9278990:Jun 6 11:57:27 poc1 kernel: [ 4981.894928] host7: xid ae0: f_ctl 90008 seq 2 9278991:Jun 6 11:57:27 poc1 kernel: [ 4981.895103] host7: xid ae0: exch: BLS rctl 84 - BLS accept 9278992:Jun 6 11:57:27 poc1 kernel: [ 4981.895152] host7: fcp: 00061e: xid 0ae0-04cc: target abort cmd passed 9278993:Jun 6 11:57:27 poc1 kernel: [ 4981.895158] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 9278994:Jun 6 11:57:27 poc1 kernel: [ 4981.895163] host7: xid d41: f_ctl 90008 seq 2 9278995:Jun 6 11:57:27 poc1 kernel: [ 4981.895382] host7: xid d41: exch: BLS rctl 84 - BLS accept 9278996:Jun 6 11:57:27 poc1 kernel: [ 4981.895432] host7: fcp: 00061e: xid 0d41-0a2a: target abort cmd passed 9278997:Jun 6 11:57:27 poc1 kernel: [ 4981.895438] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 38409501:Jun 6 12:14:54 poc1 kernel: [ 6029.674694] host7: xid ee2: f_ctl 90008 seq 2 38409502:Jun 6 12:14:54 poc1 kernel: [ 6029.674956] host7: xid ee2: exch: BLS rctl 84 - BLS accept 38409503:Jun 6 12:14:54 poc1 kernel: [ 6029.674999] host7: fcp: 00061e: xid 0ee2-034e: target abort cmd passed 38409504:Jun 6 12:14:54 poc1 kernel: [ 6029.675005] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 38409505:Jun 6 12:14:54 poc1 kernel: [ 6029.675009] host7: xid 221: f_ctl 90008 seq 2 38409506:Jun 6 12:14:54 poc1 kernel: [ 6029.675221] host7: xid 221: exch: BLS rctl 84 - BLS accept 38409507:Jun 6 12:14:54 poc1 kernel: [ 6029.675268] host7: fcp: 00061e: xid 0221-09e2: target abort cmd passed 38409508:Jun 6 12:14:54 poc1 kernel: [ 6029.675272] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 38409509:Jun 6 12:14:54 poc1 kernel: [ 6029.675276] host7: xid ecd: f_ctl 90008 seq 2 38409510:Jun 6 12:14:54 poc1 kernel: [ 6029.675523] host7: xid ecd: exch: BLS rctl 84 - BLS accept 38409511:Jun 6 12:14:54 poc1 kernel: [ 6029.675538] host7: fcp: 00061e: xid 0ecd-0964: target abort cmd passed 38409512:Jun 6 12:14:54 poc1 kernel: [ 6029.675542] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED 38409513:Jun 6 12:14:54 poc1 kernel: [ 6029.675546] host7: xid f02: f_ctl 90008 seq 2 38409514:Jun 6 12:14:54 poc1 kernel: [ 6029.675639] host7: xid f02: exch: BLS rctl 84 - BLS accept 38409515:Jun 6 12:14:54 poc1 kernel: [ 6029.675691] host7: fcp: 00061e: xid 0f02-0a8a: target abort cmd passed 38409516:Jun 6 12:14:54 poc1 kernel: [ 6029.675697] host7: fcp: 00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED Nab, We have included your changes that print the timeout values: On the target side: 40969:Jun 6 13:05:15 poc2 kernel: [ 134.665057] fc_rport_create: rdata->e_d_tov: 2000 40970:Jun 6 13:05:15 poc2 kernel: [ 134.665060] fc_rport_create: rdata->r_a_tov: 4000 40971:Jun 6 13:05:15 poc2 kernel: [ 135.355969] fc_rport_login_complete e_d_tov: 0 40972:Jun 6 13:05:15 poc2 kernel: [ 135.355972] fc_rport_flogi_resp r_a_tov: 0 40973:Jun 6 13:05:15 poc2 kernel: [ 135.355976] fc_rport_enter_plogi: rdata->e_d_tov: 2000 40974:Jun 6 13:05:15 poc2 kernel: [ 135.356206] fc_rport_login_complete e_d_tov: 2000 40975:Jun 6 13:05:15 poc2 kernel: [ 135.356973] fc_rport_work: rpriv->e_d_tov: 2000 40976:Jun 6 13:05:15 poc2 kernel: [ 135.356977] rpriv->r_a_tov: rpriv->r_a_tov: 4000 On the initiator size: 950:Jun 6 13:05:15 poc1 kernel: [ 574.652695] fc_rport_create: rdata->e_d_tov: 2000 951:Jun 6 13:05:15 poc1 kernel: [ 574.652698] fc_rport_create: rdata->r_a_tov: 4000 952:Jun 6 13:05:15 poc1 kernel: [ 575.344306] fc_rport_work: rpriv->e_d_tov: 2000 953:Jun 6 13:05:15 poc1 kernel: [ 575.344310] rpriv->r_a_tov: rpriv->r_a_tov: 4000 1024:Jun 6 13:05:23 poc1 kernel: [ 582.723206] fc_rport_work: rpriv->e_d_tov: 2000 1025:Jun 6 13:05:23 poc1 kernel: [ 582.723211] rpriv->r_a_tov: rpriv->r_a_tov: 4000 When the abort occurs, "iostat -x 1" command on the target side starts to show 0 iops on some of the drives. Sometimes all the 10 target drives have 0 iops. What is the reasonable number to assume for the maximum number of outstanding IO? Thanks, Jun On Fri, Jun 6, 2014 at 1:28 PM, Vasu Dev <vasu.dev@xxxxxxxxxxxxxxx> wrote: > On Thu, 2014-06-05 at 15:43 -0700, Nicholas A. Bellinger wrote: >> On Wed, 2014-06-04 at 17:22 -0700, Jun Wu wrote: >> > Is there design limit for the number of target drives that we should >> > not cross? Is 10 a reasonable number? We did notice that lower number >> > of target has less problems from our testing. >> > >> >> It completely depends on the fabric, initiator, and backend storage. >> >> For example, some initiators (like ESX iSCSI on 1 Gb/sec ethernet) have >> a problem with more than a handful of LUNs per session, that can result >> in false positive timeouts on the initiator side under heavy I/O loads >> due to fairness issues + I/Os not being sent out of the initiator fast >> enough. >> >> Other initiators like Qlogic FC are able to run 256 LUNs in a single >> session + endpoint without issues. >> >> > Are there any additional tests that we can do to narrow down the >> > problem? For example try different IO types, random vs sequential, >> > read vs write. Would that help? >> > >> >> If the issue is really related to the number of outstanding I/Os on your >> network, one easy thing to do is reduce the default queue_depth=3 to >> queue_depth=1 for each LUN on the initiator side, and see if that has >> any effect. >> > > Good idea and possibly that is the cause and not something else such as > switch is the factor w/o DCB and PFC PAUSE typically used and required > by fcoe. > >> I don't recall where these values are in /sys for fcoe, but are easy to > > > It is at /sys/block/sdX/device/queue_depth, it is transport agnostic and > therefore same location for fcoe also like any other disks are. > >> find using 'find /sys -name queue_depth'. Go ahead and set each of >> these for your fcoe initiators LUNs to queue_depth=1 + retest. >> > >> Also note that these values are not persistent across restart. >> >> > Nab, >> > We cannot change the connection between the servers. They are bare >> > metal cloud servers that we don't have direct access. >> > >> >> That's a shame, as it would certainly help isolate individual networking >> components. >> > > Yeah, that would have helped. > > //Vasu > >> --nab >> > > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html