+Christoph > > On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote: > > Hey Jason, > > > > >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're > > seeing SQ > > >> overflows. So the SQ sizing needs more investigation. > > > > > > NFS had this sort of problem - in that case it was because the code > > > was assuming that a RQ completion implied SQ space - that is not > > > legal, only direct completions from SQ WCs can guide available space > > > in the SQ.. > > > > Its not the same problem. iser-target does not tie SQ and RQ spaces. > > The origin here is the difference between IB/RoCE and iWARP and the > > chelsio HW that makes it hard to predict the SQ correct size. > > > > iWARP needs extra registration for rdma reads and the chelsio device > > seems to be limited in the number of pages per registration so this > > configuration will need a larger send queue. > > > > Another problem is that we don't have a correct retry flow. > > > > Hopefully we can address that in the RW API which is designed to hide > > these details from the ULP... > Hi Sagi, > Here is what our further analysis of SQ dump at the time of overflow says: > > RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI > READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max > number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs > (another possible factor is they all are unsignalled WRs and are completed > only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that > any given IO will take only 1 WRITE WR to convey the data. > > This evidently is incorrect and rdma_rw_init_qp() needs to factor and size > the queue based on max_sge of device for write and read and the sg_tablesize > for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of > initiator. If above analysis is correct, please suggest how could this be fixed? > > Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module > parameter of ib_core avoids SQ overflow by registering a single REG_MR and > using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE > WRs, becomes just 3 WRS: REG_MR + WRITE + INV_MR as > max_fast_reg_page_list_len of iw_cxgb4 is 128 page. > > (By default force_mr is not set and iw_cxgb4 could only use MR for rdma > READs only as per rdma_rw_io_needs_mr() if force_mr isnt set) > >From this is there any possibility that we could use MR if the write WR > chain exceeds a certain number? > > Thanks for your time! > I think it is time to resolve this XXX comment in rw.c for rdma_rw_io_needs_mr(): /* * Check if the device will use memory registration for this RW operation. * We currently always use memory registrations for iWarp RDMA READs, and * have a debug option to force usage of MRs. * XXX: In the future we can hopefully fine tune this based on HCA driver * input. */ Regardless of whether the HCA driver provides input, I think 30+ RDMA WRITE WR chains isn't as efficient as 1 REG_MR + 1 WRITE + 1 INV_MR. Is it unreasonable to just add some threshold in rw.c? Also, I think rdma_rw_init_qp() does need some tweaks: It needs to take into account the max sge depth, the max REG_MR depth, and the max SQ depth device attributes/capabilities when sizing the SQ. However, if that computed depth exceeds the device max, then the SQ will not be big enough to avoid potential overflowing, and I believe ULPs should _always_ flow control their outgoing WRs based on the SQ depth regardless. And perhaps rdma-rw should even avoid overly deep SQs just because that tends to inhibit scalability. EG: allowing lots of shallow QPs vs consuming all the device resources with very deep QPs... Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html