Re: SQ overflow seen running isert traffic

Potnuri Bharat Teja <bharat@xxxxxxxxxxx> · Mon, 20 Mar 2017 18:35:47 +0530

On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote:
>    Hey Jason,
> 
>    >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're
>    seeing SQ
>    >> overflows.  So the SQ sizing needs more investigation.
>    >
>    > NFS had this sort of problem - in that case it was because the code
>    > was assuming that a RQ completion implied SQ space - that is not
>    > legal, only direct completions from SQ WCs can guide available space
>    > in the SQ..
> 
>    Its not the same problem. iser-target does not tie SQ and RQ spaces.
>    The origin here is the difference between IB/RoCE and iWARP and the
>    chelsio HW that makes it hard to predict the SQ correct size.
> 
>    iWARP needs extra registration for rdma reads and the chelsio device
>    seems to be limited in the number of pages per registration so this
>    configuration will need a larger send queue.
> 
>    Another problem is that we don't have a correct retry flow.
> 
>    Hopefully we can address that in the RW API which is designed to hide
>    these details from the ULP...
Hi Sagi,
Here is what our further analysis of SQ dump at the time of overflow says:

RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI 
READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max 
number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs 
(another possible factor is they all are unsignalled WRs and are completed
only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that 
any given IO will take only 1 WRITE WR to convey the data.

This evidently is incorrect and rdma_rw_init_qp() needs to factor and size 
the queue based on max_sge of device for write and read and the sg_tablesize 
for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of 
initiator. If above analysis is correct, please suggest how could this be fixed? 

Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module 
parameter of ib_core avoids SQ overflow by registering a single REG_MR and 
using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE 
WRs, becomes just 3 WRS:  REG_MR + WRITE + INV_MR as 
max_fast_reg_page_list_len of iw_cxgb4 is 128 page.

(By default force_mr is not set and iw_cxgb4 could only use MR for rdma 
READs only as per rdma_rw_io_needs_mr() if force_mr isnt set) 
>From this is there any possibility that we could use MR if the write WR
chain exceeds a certain number?

Thanks for your time!

-Bharat.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html