> On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote: >> In the case of small NVMe-oF queue size (<32) we may enter >> a deadlock caused by the fact that the IB completions aren't sent >> waiting for 32 and the send queue will fill up. >> >> The error is seen as (using mlx5): >> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273): >> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12 >> >> This patch changes the way the signaling is done so >> that it depends on the queue depth now. The magic define has >> been removed completely. >> >> Signed-off-by: Marta Rybczynska <marta.rybczynska@xxxxxxxxx> >> Signed-off-by: Samuel Jones <sjones@xxxxxxxxx> >> --- >> Changes from v1: >> * signal by queue size/2, remove hardcoded 32 >> * support queue depth of 1 >> >> drivers/nvme/host/rdma.c | 17 +++++++++++++---- >> 1 file changed, 13 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c >> index 47a479f..4de1b92 100644 >> --- a/drivers/nvme/host/rdma.c >> +++ b/drivers/nvme/host/rdma.c >> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct >> ib_wc *wc) >> nvme_rdma_wr_error(cq, wc, "SEND"); >> } >> >> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue) >> +{ >> + int sig_limit; >> + >> + /* We signal completion every queue depth/2 and also >> + * handle the case of possible device with queue_depth=1, >> + * where we would need to signal every message. >> + */ >> + sig_limit = max(queue->queue_size / 2, 1); >> + return (++queue->sig_count % sig_limit) == 0; >> +} >> + >> static int nvme_rdma_post_send(struct nvme_rdma_queue *queue, >> struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge, >> struct ib_send_wr *first, bool flush) >> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue >> *queue, >> * Would have been way to obvious to handle this in hardware or >> * at least the RDMA stack.. >> * >> - * This messy and racy code sniplet is copy and pasted from the iSER >> - * initiator, and the magic '32' comes from there as well. >> - * >> * Always signal the flushes. The magic request used for the flush >> * sequencer is not allocated in our driver's tagset and it's >> * triggered to be freed by blk_cleanup_queue(). So we need to >> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue >> *queue, >> * embedded in request's payload, is not freed when __ib_process_cq() >> * calls wr_cqe->done(). >> */ >> - if ((++queue->sig_count % 32) == 0 || flush) >> + if (nvme_rdma_queue_sig_limit(queue) || flush) >> wr.send_flags |= IB_SEND_SIGNALED; >> >> if (first) > > Hello Marta, > > The approach of this patch is suboptimal from a performance point of view. > If the number of WRs that have been submitted since the last signaled WR > was submitted would be tracked in a member variable that would allow to > get rid of the (relatively slow) division operation. > Hello Bart, I think that we can remove the division (the modulo sig_limit). The sig_count is an u8 so it is really a type of variable you propose. It isn't used anywhere it seems so we can change the way it is used in the snippet to count until the signaling moment. It will give something like: static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue) { int sig_limit; /* We signal completion every queue depth/2 and also * handle the case of possible device with queue_depth=1, * where we would need to signal every message. */ sig_limit = max(queue->queue_size / 2, 1); queue->sig_count++; if (queue->sig_count < sig_limit) return 0; queue->sig_count = 0; return 1; } Do you like it better? Marta -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html