problem description : Target : created a raid1 array use 8 disks, Initiator : The iscsi negotiation result : ImmediateData=Yes InitialR2T=Yes FirstBurstLength=65536 MaxBurstLength=262144 MaxRecvDataSegmentLength=262144 In Initiator, we perform a iozone test (iozone -a -i 0 -i 2 -n 64g -g 64g -y 64K -q 64K -f /mnt/iot) on it, and When running to random tests, the target report: "Unable to recover from DataOut timeout while in ERL=0", and the Initiator report : "connection1:0: detected conn error (1020)" problem analysis : Because FirstBurstLength=65536, In a 64k random write test, A large amount of data is sent to the target through the Initiator "conn->cmdqueue" queue, If a merge request occurs at this time(The probability of merging in random tests is very low), and it will lead to have some data sent to the target through the "conn->requeue" queue. Because A large amount of data is sent to the target through the "conn->cmdqueue" queue, from the design of the iscsi_data_xmit function, the "While (! List_empty (& conn->cmdqueue))" will always be true, and ultimately make the "conn->requeue" queue can not be processed in time, so the target will can not get the report, and cause a time-out. and also, we can open the kernel debug and Extend the timeout, we can get dmesg ( 64k random write test and occurs a merge request ) : 17:34:41 kylinOS kernel: Got SCSI Command, ITT: 0x00000015, CmdSN: 0x8f500e00, ExpXferLen: 131072, Length: 65536, CID: 0 .............. 17:34:41 kylinOS kernel: Built R2T, ITT: 0x00000015, TTT: 0x000ead3e, StatSN: 0x111c0410, R2TSN: 0x00000000, Offset: 65536, DDTL: 65536, CID: 0 17:34:41 kylinOS kernel: ret: 48, sent data: 48 17:34:41 kylinOS kernel: Starting DataOUT timer for ITT: 0x00000015 on CID: 0, timeout:203. 17:34:41 kylinOS kernel: Updated MaxCmdSN to 0x000e50c4 ........... 17:34:47 kylinOS kernel: Got DataOut ITT: 0x00000015, TTT: 0x3ead0e00, DataSN: 0x00000000, Offset: 65536, Length: 65536, CID: 0 17:34:47 kylinOS kernel: Updated DataOUT timer for ITT: 0x00000015, timeout:203. .......... 17:34:47 kylinOS kernel: Stopped DataOUT Timer for ITT: 0x00000015 as you see, the merge request(128K) "ITT: 0x00000015", Delay 6S to complete So I think the iscsi_data_xmit function need to modify to reduce I/O delay and unnecessary time-out. problem solutions : We should add a timeout mechanism to the while loop : diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index c051694..77108fe 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -1492,6 +1492,8 @@ EXPORT_SYMBOL_GPL(iscsi_requeue_task); static int iscsi_data_xmit(struct iscsi_conn *conn) { struct iscsi_task *task; + unsigned long cmd_timeout = 0; + unsigned long req_timeout = 0; int rc = 0; spin_lock_bh(&conn->session->frwd_lock); @@ -1530,8 +1532,11 @@ check_mgmt: goto done; } +check_cmd: /* process pending command queue */ - while (!list_empty(&conn->cmdqueue)) { + if (!cmd_timeout) + cmd_timeout = jiffies + HZ/2; /* timeout in 0.5s */ + while (!list_empty(&conn->cmdqueue) && time_before(jiffies, cmd_timeout)) { conn->task = list_entry(conn->cmdqueue.next, struct iscsi_task, running); list_del_init(&conn->task->running); @@ -1562,7 +1567,10 @@ check_mgmt: goto check_mgmt; } - while (!list_empty(&conn->requeue)) { +check_req: + if (!req_timeout) + req_timeout = jiffies + HZ/2; /* timeout in 0.5s */ + while (!list_empty(&conn->requeue) && time_before(jiffies, req_timeout)) { /* * we always do fastlogout - conn stop code will clean up. */ @@ -1583,6 +1591,15 @@ check_mgmt: if (!list_empty(&conn->mgmtqueue)) goto check_mgmt; } + + /* Check whether there are data needs to be sent */ + cmd_timeout = 0; + req_timeout = 0; + if (!list_empty(&conn->cmdqueue)) + goto check_cmd; + if (!list_empty(&conn->requeue)) + goto check_req; + spin_unlock_bh(&conn->session->frwd_lock); return -ENODATA; Do you have any idea on this problem? Looking forward to your answer! thank you!