On 5/13/22 08:38, Yanjun Zhu wrote: > > > 在 2022/5/13 10:40, Yanjun Zhu 写道: >> 在 2022/5/13 3:49, Bob Pearson 写道: >>> Currently the completer tasklet when it sets the retransmit timer or the >>> nak timer sets the same flag (qp->req.need_retry) so that if either >>> timer fires it will attempt to perform a retry flow on the send queue. >>> This has the effect of responding to an RNR NAK at the first retransmit >>> timer event which does not allow for the requested rnr timeout. >>> >>> This patch adds a new flag (qp->req.need_rnr_timeout) which, if set, >>> prevents a retry flow until the rnr nak timer fires. >>> >>> This patch fixes rnr retry errors which can be observed by running the >>> pyverbs test suite 50-100X. With this patch applied they do not occur. >> >> Do you mean that running run_tests.py for 50-100times can reproduce this bug? I want to reproduce this problem. > > Running run_tests.py for 50-100times, I can not reproduce this problem I went back and looked. The occasionally failing test case is: test_rdmacm_async_traffic_external_qp This test is normally skipped for rxe because of a hack in <path to rdma-core>/tests/base.py. I sent you a patch that enables these tests a few days ago. It randomly causes an rnr retry about 1/3 of the time. Most of these are repaired by the retry timer going off. A few ~1-2% are not. Bob > > Zhu Yanjun > >> >> Thanks a lot. >> Zhu Yanjun >> >>> >>> Link: https://lore.kernel.org/linux-rdma/a8287823-1408-4273-bc22-99a0678db640@xxxxxxxxx/ >>> Fixes: 8700e3e7c485 ("Soft RoCE (RXE) - The software RoCE driver") >>> Signed-off-by: Bob Pearson <rpearsonhpe@xxxxxxxxx> >>> --- >>> drivers/infiniband/sw/rxe/rxe_comp.c | 4 +--- >>> drivers/infiniband/sw/rxe/rxe_qp.c | 1 + >>> drivers/infiniband/sw/rxe/rxe_req.c | 6 ++++-- >>> drivers/infiniband/sw/rxe/rxe_verbs.h | 1 + >>> 4 files changed, 7 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c >>> index 138b3e7d3a5f..bc668cb211b1 100644 >>> --- a/drivers/infiniband/sw/rxe/rxe_comp.c >>> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c >>> @@ -733,9 +733,7 @@ int rxe_completer(void *arg) >>> if (qp->comp.rnr_retry != 7) >>> qp->comp.rnr_retry--; >>> - qp->req.need_retry = 1; >>> - pr_debug("qp#%d set rnr nak timer\n", >>> - qp_num(qp)); >>> + qp->req.need_rnr_timeout = 1; >>> mod_timer(&qp->rnr_nak_timer, >>> jiffies + rnrnak_jiffies(aeth_syn(pkt) >>> & ~AETH_TYPE_MASK)); >>> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c >>> index 62acf890af6c..1c962468714e 100644 >>> --- a/drivers/infiniband/sw/rxe/rxe_qp.c >>> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c >>> @@ -513,6 +513,7 @@ static void rxe_qp_reset(struct rxe_qp *qp) >>> atomic_set(&qp->ssn, 0); >>> qp->req.opcode = -1; >>> qp->req.need_retry = 0; >>> + qp->req.need_rnr_timeout = 0; >>> qp->req.noack_pkts = 0; >>> qp->resp.msn = 0; >>> qp->resp.opcode = -1; >>> diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c >>> index ae5fbc79dd5c..770ae4279f73 100644 >>> --- a/drivers/infiniband/sw/rxe/rxe_req.c >>> +++ b/drivers/infiniband/sw/rxe/rxe_req.c >>> @@ -103,7 +103,8 @@ void rnr_nak_timer(struct timer_list *t) >>> { >>> struct rxe_qp *qp = from_timer(qp, t, rnr_nak_timer); >>> - pr_debug("qp#%d rnr nak timer fired\n", qp_num(qp)); >>> + qp->req.need_retry = 1; >>> + qp->req.need_rnr_timeout = 0; >>> rxe_run_task(&qp->req.task, 1); >>> } >>> @@ -624,10 +625,11 @@ int rxe_requester(void *arg) >>> qp->req.need_rd_atomic = 0; >>> qp->req.wait_psn = 0; >>> qp->req.need_retry = 0; >>> + qp->req.need_rnr_timeout = 0; >>> goto exit; >>> } >>> - if (unlikely(qp->req.need_retry)) { >>> + if (unlikely(qp->req.need_retry && !qp->req.need_rnr_timeout)) { >>> req_retry(qp); >>> qp->req.need_retry = 0; >>> } >>> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h >>> index e7eff1ca75e9..ab3186478c3f 100644 >>> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h >>> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h >>> @@ -123,6 +123,7 @@ struct rxe_req_info { >>> int need_rd_atomic; >>> int wait_psn; >>> int need_retry; >>> + int need_rnr_timeout; >>> int noack_pkts; >>> struct rxe_task task; >>> }; >>> >>> base-commit: c5eb0a61238dd6faf37f58c9ce61c9980aaffd7a >>