Re: SQ overflow seen running isert traffic

Sagi Grimberg <sagi@xxxxxxxxxxx> · Tue, 21 Mar 2017 15:52:30 +0200

Hi Baharat and Nic,

Apologies for the late reply,

Hi Nicholas,
I see them from 2MB onwards.

   > Here is what I see with the 3 patches alone applied:
   >
   > -> In isert_put_datain() and isert_post_response() a corresponding recv
   WR is posted before
   > posting a send and hence for every post failure a recv is already posted
   into a tightly packed
   > RQ, causing it to overflow.

   Just for me to understand, the intermittent TMR ABORT_TASKs are caused
   by the repeated failure to post RDMA_WRITE WRs for a large ISER Data-In
   payload, due to mis-sizing of needed WRs from RDMA R/W API vs.
   underlying hardware capabilities.
Yes.

   Moving the recv posts after the send post for RDMA_WRITEs helps to
   reduce the severity of the issue with iw_cxgb4, but doesn't completely
   eliminate the issue under load.
Moving recv posts only comes in to effect along with your changes.

...

   So the reason why your patch to swap post_recv -> post_send to post_send
   -> post_recv makes a difference is because it allows enough trickle of
   RDMA_WRITEs to make it through, where iser-initiator doesn't attempt to
   escalate recovery and doesn't attempt session reinstatement.
I dont exactly know if above thing comes into play but the actual reason I did
swap posting RQ and SQ is, unlike SQ, RQ is posted with WRs to the brim during
the intialisation itself. From thereon we post a RQ WR for every RQ completion
That makes it almost full at any point of time.

Now in our scenario, SQ is miscalulated and too small for few adapters and so
filled gradually as the IO starts. Once SQ is full, according to your patches
isert queues it and tries to repost the command again. Here in iser functions
like isert_post_response(), isert_put_datain() post send is done after post recv.
For the first post send failure in say isert_put_datain(), the corresponding
post recv is already posted, then on queuing the command and trying reposting
an extra recv is again posted which fills up the RQ also.

 By swapping post recv and send as in my incermental patch, we dont post that
extra recv, and post recv only on successful post send.
Therfore I think this incremental patch is necessary.

Reversing the order to recv and send posting will cause problems
in stress IO workloads (especially for iWARP). The problem of sending
a reply before reposting the recv buffer is that the initiator can send
immediately a new request and we don't have a recv buffer waiting for
it, which will cause RNR-NAK. This *will* cause performance drops and
jitters for sure.

How about we just track the rx_desc to know if we already posted it as
a start (untested as I don't have access to RDMA HW this week):
--

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c
index 9b33c0c97468..fcbed35e95a8 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -817,6 +817,7 @@ isert_post_recvm(struct isert_conn *isert_conn, u32 
count)
                rx_wr->sg_list = &rx_desc->rx_sg;
                rx_wr->num_sge = 1;
                rx_wr->next = rx_wr + 1;
+               rx_desc->in_use = false;
        }
        rx_wr--;
        rx_wr->next = NULL; /* mark end of work requests list */
@@ -835,6 +836,15 @@ isert_post_recv(struct isert_conn *isert_conn, 
struct iser_rx_desc *rx_desc)
        struct ib_recv_wr *rx_wr_failed, rx_wr;
        int ret;

+       if (!rx_desc->in_use) {
+               /*
+                * if the descriptor is not in-use we already reposted it
+                * for recv, so just silently return
+                */
+               return 0;
+       }
+
+       rx_desc->in_use = false;
        rx_wr.wr_cqe = &rx_desc->rx_cqe;
        rx_wr.sg_list = &rx_desc->rx_sg;
        rx_wr.num_sge = 1;
@@ -1397,6 +1407,8 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
                return;
        }

+       rx_desc->in_use = true;
+
        ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr,
                        ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);

diff --git a/drivers/infiniband/ulp/isert/ib_isert.h 
b/drivers/infiniband/ulp/isert/ib_isert.h
index c02ada57d7f5..87d994de8c91 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -60,7 +60,7 @@

 #define ISER_RX_PAD_SIZE       (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
                (ISER_RX_PAYLOAD_SIZE + sizeof(u64) + sizeof(struct 
ib_sge) + \
-                sizeof(struct ib_cqe)))
+                sizeof(struct ib_cqe) + sizeof(bool)))

 #define ISCSI_ISER_SG_TABLESIZE                256

@@ -85,6 +85,7 @@ struct iser_rx_desc {
        u64             dma_addr;
        struct ib_sge   rx_sg;
        struct ib_cqe   rx_cqe;
+       bool            in_use;
        char            pad[ISER_RX_PAD_SIZE];
 } __packed;
--

We have a lot of room for cleanups in isert... I'll need to
make some time to get it going...

I'll be waiting to hear from you if it makes your issue go away.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html