drivers/scsi/libiscsi.c : Fix iscsi_data_xmit function design defects cause a time-out

"胡海" <huhai@xxxxxxxxxx> · Wed, 16 May 2018 12:00:57 +0800

problem description :
    Target ：   created a raid1 array use 8 disks, 
    Initiator : 
    The iscsi negotiation result :
        ImmediateData=Yes
        InitialR2T=Yes
        FirstBurstLength=65536
        MaxBurstLength=262144
        MaxRecvDataSegmentLength=262144
        
        In Initiator, we perform a iozone test (iozone -a -i 0 -i 2 -n 64g -g 64g  -y 64K -q 64K -f /mnt/iot) on it,
        and When running to random tests, the target report: "Unable to recover from DataOut timeout while in ERL=0",
        and the Initiator report : "connection1:0: detected conn error (1020)"
        

        
problem analysis :
    
    Because FirstBurstLength=65536, In a 64k random write test， 
    A large amount of data is sent to the target through the Initiator "conn->cmdqueue" queue,
    If a merge request occurs at this time（The probability of merging in random tests is very low), 
    and it will lead to have some data sent to the target through the "conn->requeue" queue.
    Because A large amount of data is sent to the target through the "conn->cmdqueue" queue, 
    from the design of the iscsi_data_xmit function,  
    the "While (! List_empty (& conn->cmdqueue))" will always be true, 
    and ultimately make the "conn->requeue" queue can not be processed in time,
    so the target will can not get the report, and cause a time-out.
    
    
    and also, we can open the kernel debug and Extend the timeout, we can get dmesg ( 64k random write test and occurs a merge request ) :
    
                17:34:41 kylinOS kernel: Got SCSI Command, ITT: 0x00000015, CmdSN: 0x8f500e00, ExpXferLen: 131072, Length: 65536, CID: 0
                ..............
                17:34:41 kylinOS kernel: Built R2T, ITT: 0x00000015, TTT: 0x000ead3e, StatSN: 0x111c0410, R2TSN: 0x00000000, Offset: 65536, DDTL: 65536, CID: 0
                17:34:41 kylinOS kernel: ret: 48, sent data: 48
                17:34:41 kylinOS kernel: Starting DataOUT timer for ITT: 0x00000015 on CID: 0, timeout:203.
                17:34:41 kylinOS kernel: Updated MaxCmdSN to 0x000e50c4
                ...........
                17:34:47 kylinOS kernel: Got DataOut ITT: 0x00000015, TTT: 0x3ead0e00, DataSN: 0x00000000, Offset: 65536, Length: 65536, CID: 0
                17:34:47 kylinOS kernel: Updated DataOUT timer for ITT: 0x00000015, timeout:203.
                ..........
                17:34:47 kylinOS kernel: Stopped DataOUT Timer for ITT: 0x00000015

    as you see, the merge request(128K) "ITT: 0x00000015", Delay 6S to complete

    So I think the iscsi_data_xmit function need to modify to reduce I/O delay and unnecessary time-out.



problem solutions ： 
    We should add a timeout mechanism to the while loop :

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index c051694..77108fe 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -1492,6 +1492,8 @@ EXPORT_SYMBOL_GPL(iscsi_requeue_task);
 static int iscsi_data_xmit(struct iscsi_conn *conn)
 {
        struct iscsi_task *task;
+       unsigned long cmd_timeout = 0;
+       unsigned long req_timeout = 0;
        int rc = 0;
 
        spin_lock_bh(&conn->session->frwd_lock);
@@ -1530,8 +1532,11 @@ check_mgmt:
                        goto done;
        }
 
+check_cmd:
        /* process pending command queue */
-       while (!list_empty(&conn->cmdqueue)) {
+       if (!cmd_timeout)
+               cmd_timeout = jiffies + HZ/2; /* timeout in 0.5s */
+       while (!list_empty(&conn->cmdqueue) && time_before(jiffies, cmd_timeout)) {
                conn->task = list_entry(conn->cmdqueue.next, struct iscsi_task,
                                        running);
                list_del_init(&conn->task->running);
@@ -1562,7 +1567,10 @@ check_mgmt:
                        goto check_mgmt;
        }
 
-       while (!list_empty(&conn->requeue)) {
+check_req:
+       if (!req_timeout)
+               req_timeout = jiffies + HZ/2; /* timeout in 0.5s */
+       while (!list_empty(&conn->requeue) && time_before(jiffies, req_timeout)) {
                /*
                 * we always do fastlogout - conn stop code will clean up.
                 */
@@ -1583,6 +1591,15 @@ check_mgmt:
                if (!list_empty(&conn->mgmtqueue))
                        goto check_mgmt;
        }
+
+       /* Check whether there are data needs to be sent */
+       cmd_timeout = 0;
+       req_timeout = 0;
+       if (!list_empty(&conn->cmdqueue))
+               goto check_cmd;
+       if (!list_empty(&conn->requeue))
+               goto check_req;
+
        spin_unlock_bh(&conn->session->frwd_lock);
        return -ENODATA;


Do you have any idea on this problem? 
Looking forward to your answer!
thank you!