Re: [PATCH for-rc] RDMA/rxe: Fix rnr retry behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/13/22 21:30, Yanjun Zhu wrote:
> 
> 在 2022/5/14 2:22, Bob Pearson 写道:
>> On 5/13/22 08:38, Yanjun Zhu wrote:
>>>
>>> 在 2022/5/13 10:40, Yanjun Zhu 写道:
>>>> 在 2022/5/13 3:49, Bob Pearson 写道:
>>>>> Currently the completer tasklet when it sets the retransmit timer or the
>>>>> nak timer sets the same flag (qp->req.need_retry) so that if either
>>>>> timer fires it will attempt to perform a retry flow on the send queue.
>>>>> This has the effect of responding to an RNR NAK at the first retransmit
>>>>> timer event which does not allow for the requested rnr timeout.
>>>>>
>>>>> This patch adds a new flag (qp->req.need_rnr_timeout) which, if set,
>>>>> prevents a retry flow until the rnr nak timer fires.
>>>>>
>>>>> This patch fixes rnr retry errors which can be observed by running the
>>>>> pyverbs test suite 50-100X. With this patch applied they do not occur.
>>>> Do you mean that running run_tests.py for 50-100times can reproduce this bug? I want to reproduce this problem.
>>> Running run_tests.py for 50-100times, I can not reproduce this problem
>> I went back and looked. The occasionally failing test case is:
>>
>>     test_rdmacm_async_traffic_external_qp
>>
>> This test is normally skipped for rxe because of a hack in <path to rdma-core>/tests/base.py. I sent you
>> a patch that enables these tests a few days ago.
> 
> What is the patch? Can you send me it again?

diff --git a/tests/base.py b/tests/base.py

index 93568c7f..03b5601e 100644

--- a/tests/base.py

+++ b/tests/base.py

@@ -245,10 +245,11 @@ class RDMATestCase(unittest.TestCase):

         self._add_gids_per_port(ctx, dev, self.ib_port)

 

     def _get_ip_mac(self, dev, port, idx):

-        if not os.path.exists('/sys/class/infiniband/{}/device/net/'.format(dev)):

-            self.args.append([dev, port, idx, None, None])

-            return

-        net_name = self.get_net_name(dev)

+        #if not os.path.exists('/sys/class/infiniband/{}/device/net/'.format(dev)):

+            #self.args.append([dev, port, idx, None, None])

+            #return

+        #net_name = self.get_net_name(dev)

+        net_name = "enp0s3"

         try:

             ip_addr, mac_addr = self.get_ip_mac_address(net_name)

         except (KeyError, IndexError):


Save this to a file (e.g. net_name.patch) and type git apply net_name.patch.

Before doing this change "enp0s3" to the correct name for the Ethernet adapter you are
using in your system. tests/base.py assumes that all roce devices have a file with a path

/sys/class/infiniband/rxe0/device/net/

But this is not always the case and rxe does not have that path. This breaks all the
rdmacm test cases in pyverbs.

> 
> And what is the symptoms when this problem occurs?

When I run

# sudo ./build/bin/run_tests -k test_rdmacm_async_traffic_external_qp several times

I will eventually see an rnr timeout error. You may not have this experience but you
can add print statements to the retransmit_timer and rnr_timer routines in rxe_comp.c
and rxe_req.c respectively. You should see that without the below patch you will
never see the rnr_timer fire and you will see lots of retransmit_timer fires.

The current code responds to the retry timer by initiating a send queue retry and
the test will usually pass if it has sent an RNR NAK and tear down the rnr timer
before it have a chance to fire.

Bob
> 
> I want to check if I can reproduce this problem.
> 
> 
> Thanks and Regards,
> 
> Zhu Yanjun

Thanks

Bob
> 
>>
>> It randomly causes an rnr retry about 1/3 of the time. Most of these are repaired by the retry timer
>> going off. A few ~1-2% are not.
>>
>> Bob
>>> Zhu Yanjun
>>>
>>>> Thanks a lot.
>>>> Zhu Yanjun
>>>>
>>>>> Link: https://lore.kernel.org/linux-rdma/a8287823-1408-4273-bc22-99a0678db640@xxxxxxxxx/
>>>>> Fixes: 8700e3e7c485 ("Soft RoCE (RXE) - The software RoCE driver")
>>>>> Signed-off-by: Bob Pearson <rpearsonhpe@xxxxxxxxx>
>>>>> ---
>>>>>    drivers/infiniband/sw/rxe/rxe_comp.c  | 4 +---
>>>>>    drivers/infiniband/sw/rxe/rxe_qp.c    | 1 +
>>>>>    drivers/infiniband/sw/rxe/rxe_req.c   | 6 ++++--
>>>>>    drivers/infiniband/sw/rxe/rxe_verbs.h | 1 +
>>>>>    4 files changed, 7 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
>>>>> index 138b3e7d3a5f..bc668cb211b1 100644
>>>>> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
>>>>> @@ -733,9 +733,7 @@ int rxe_completer(void *arg)
>>>>>                    if (qp->comp.rnr_retry != 7)
>>>>>                        qp->comp.rnr_retry--;
>>>>> -                qp->req.need_retry = 1;
>>>>> -                pr_debug("qp#%d set rnr nak timer\n",
>>>>> -                     qp_num(qp));
>>>>> +                qp->req.need_rnr_timeout = 1;
>>>>>                    mod_timer(&qp->rnr_nak_timer,
>>>>>                          jiffies + rnrnak_jiffies(aeth_syn(pkt)
>>>>>                            & ~AETH_TYPE_MASK));
>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
>>>>> index 62acf890af6c..1c962468714e 100644
>>>>> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
>>>>> @@ -513,6 +513,7 @@ static void rxe_qp_reset(struct rxe_qp *qp)
>>>>>        atomic_set(&qp->ssn, 0);
>>>>>        qp->req.opcode = -1;
>>>>>        qp->req.need_retry = 0;
>>>>> +    qp->req.need_rnr_timeout = 0;
>>>>>        qp->req.noack_pkts = 0;
>>>>>        qp->resp.msn = 0;
>>>>>        qp->resp.opcode = -1;
>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
>>>>> index ae5fbc79dd5c..770ae4279f73 100644
>>>>> --- a/drivers/infiniband/sw/rxe/rxe_req.c
>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_req.c
>>>>> @@ -103,7 +103,8 @@ void rnr_nak_timer(struct timer_list *t)
>>>>>    {
>>>>>        struct rxe_qp *qp = from_timer(qp, t, rnr_nak_timer);
>>>>> -    pr_debug("qp#%d rnr nak timer fired\n", qp_num(qp));
>>>>> +    qp->req.need_retry = 1;
>>>>> +    qp->req.need_rnr_timeout = 0;
>>>>>        rxe_run_task(&qp->req.task, 1);
>>>>>    }
>>>>> @@ -624,10 +625,11 @@ int rxe_requester(void *arg)
>>>>>            qp->req.need_rd_atomic = 0;
>>>>>            qp->req.wait_psn = 0;
>>>>>            qp->req.need_retry = 0;
>>>>> +        qp->req.need_rnr_timeout = 0;
>>>>>            goto exit;
>>>>>        }
>>>>> -    if (unlikely(qp->req.need_retry)) {
>>>>> +    if (unlikely(qp->req.need_retry && !qp->req.need_rnr_timeout)) {
>>>>>            req_retry(qp);
>>>>>            qp->req.need_retry = 0;
>>>>>        }
>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
>>>>> index e7eff1ca75e9..ab3186478c3f 100644
>>>>> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
>>>>> @@ -123,6 +123,7 @@ struct rxe_req_info {
>>>>>        int            need_rd_atomic;
>>>>>        int            wait_psn;
>>>>>        int            need_retry;
>>>>> +    int            need_rnr_timeout;
>>>>>        int            noack_pkts;
>>>>>        struct rxe_task        task;
>>>>>    };
>>>>>
>>>>> base-commit: c5eb0a61238dd6faf37f58c9ce61c9980aaffd7a




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux