在 2022/5/14 2:22, Bob Pearson 写道:
On 5/13/22 08:38, Yanjun Zhu wrote:
在 2022/5/13 10:40, Yanjun Zhu 写道:
在 2022/5/13 3:49, Bob Pearson 写道:
Currently the completer tasklet when it sets the retransmit timer or the
nak timer sets the same flag (qp->req.need_retry) so that if either
timer fires it will attempt to perform a retry flow on the send queue.
This has the effect of responding to an RNR NAK at the first retransmit
timer event which does not allow for the requested rnr timeout.
This patch adds a new flag (qp->req.need_rnr_timeout) which, if set,
prevents a retry flow until the rnr nak timer fires.
This patch fixes rnr retry errors which can be observed by running the
pyverbs test suite 50-100X. With this patch applied they do not occur.
Do you mean that running run_tests.py for 50-100times can reproduce this bug? I want to reproduce this problem.
Running run_tests.py for 50-100times, I can not reproduce this problem
I went back and looked. The occasionally failing test case is:
test_rdmacm_async_traffic_external_qp
This test is normally skipped for rxe because of a hack in <path to rdma-core>/tests/base.py. I sent you
a patch that enables these tests a few days ago.
What is the patch? Can you send me it again?
And what is the symptoms when this problem occurs?
I want to check if I can reproduce this problem.
Thanks and Regards,
Zhu Yanjun
It randomly causes an rnr retry about 1/3 of the time. Most of these are repaired by the retry timer
going off. A few ~1-2% are not.
Bob
Zhu Yanjun
Thanks a lot.
Zhu Yanjun
Link: https://lore.kernel.org/linux-rdma/a8287823-1408-4273-bc22-99a0678db640@xxxxxxxxx/
Fixes: 8700e3e7c485 ("Soft RoCE (RXE) - The software RoCE driver")
Signed-off-by: Bob Pearson <rpearsonhpe@xxxxxxxxx>
---
drivers/infiniband/sw/rxe/rxe_comp.c | 4 +---
drivers/infiniband/sw/rxe/rxe_qp.c | 1 +
drivers/infiniband/sw/rxe/rxe_req.c | 6 ++++--
drivers/infiniband/sw/rxe/rxe_verbs.h | 1 +
4 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index 138b3e7d3a5f..bc668cb211b1 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -733,9 +733,7 @@ int rxe_completer(void *arg)
if (qp->comp.rnr_retry != 7)
qp->comp.rnr_retry--;
- qp->req.need_retry = 1;
- pr_debug("qp#%d set rnr nak timer\n",
- qp_num(qp));
+ qp->req.need_rnr_timeout = 1;
mod_timer(&qp->rnr_nak_timer,
jiffies + rnrnak_jiffies(aeth_syn(pkt)
& ~AETH_TYPE_MASK));
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 62acf890af6c..1c962468714e 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -513,6 +513,7 @@ static void rxe_qp_reset(struct rxe_qp *qp)
atomic_set(&qp->ssn, 0);
qp->req.opcode = -1;
qp->req.need_retry = 0;
+ qp->req.need_rnr_timeout = 0;
qp->req.noack_pkts = 0;
qp->resp.msn = 0;
qp->resp.opcode = -1;
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index ae5fbc79dd5c..770ae4279f73 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -103,7 +103,8 @@ void rnr_nak_timer(struct timer_list *t)
{
struct rxe_qp *qp = from_timer(qp, t, rnr_nak_timer);
- pr_debug("qp#%d rnr nak timer fired\n", qp_num(qp));
+ qp->req.need_retry = 1;
+ qp->req.need_rnr_timeout = 0;
rxe_run_task(&qp->req.task, 1);
}
@@ -624,10 +625,11 @@ int rxe_requester(void *arg)
qp->req.need_rd_atomic = 0;
qp->req.wait_psn = 0;
qp->req.need_retry = 0;
+ qp->req.need_rnr_timeout = 0;
goto exit;
}
- if (unlikely(qp->req.need_retry)) {
+ if (unlikely(qp->req.need_retry && !qp->req.need_rnr_timeout)) {
req_retry(qp);
qp->req.need_retry = 0;
}
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index e7eff1ca75e9..ab3186478c3f 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -123,6 +123,7 @@ struct rxe_req_info {
int need_rd_atomic;
int wait_psn;
int need_retry;
+ int need_rnr_timeout;
int noack_pkts;
struct rxe_task task;
};
base-commit: c5eb0a61238dd6faf37f58c9ce61c9980aaffd7a