Re: Fwd: pyverbs test_resize_cq fails in some cases

Bob Pearson <rpearsonhpe@xxxxxxxxx> · Wed, 19 Jul 2023 11:14:34 -0500

On 7/18/23 10:13, Edward Srouji wrote:
> 
> On 7/18/2023 5:08 AM, Bob Pearson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 3/23/23 21:38, Bob Pearson wrote:
>>>
>>>
>>> -------- Forwarded Message --------
>>> Subject: pyverbs test_resize_cq fails in some cases
>>> Date: Thu, 23 Mar 2023 21:37:28 -0500
>>> From: Bob Pearson <rpearsonhpe@xxxxxxxxx>
>>> To: Edward Srouji <edwards@xxxxxxxxxx>
>>> CC: Jason Gunthorpe <jgg@xxxxxxxxxx>, Zhu Yanjun <zyjzyj2000@xxxxxxxxx>
>>>
>>> Edward,
>>>
>>> The pyverbs test: test_resize_cq fails for the rxe driver in some cases.
>>>
>>> The code is definitely racy:
>>>
>>> def test_resize_cq(self):
>>>          """
>>>          Test resize CQ, start with specific value and then increase and decrease
>>>          the CQ size. The test also check bad flow of decrease the CQ size when
>>>          there are more completions on it than the new value.
>>>          """
>>>          self.create_players(CQUDResources, cq_depth=3)
>>>          # Decrease the CQ size.
>>>          new_cq_size = 1
>>>          try:
>>>              self.client.cq.resize(new_cq_size)
>>>          except PyverbsRDMAError as ex:
>>>              if ex.error_code == errno.EOPNOTSUPP:
>>>                  raise unittest.SkipTest('Resize CQ is not supported')
>>>              raise ex
>>>          self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>>                          f'The actual CQ size ({self.client.cq.cqe}) is less '
>>>                          'than guaranteed ({new_cq_size})')
>>>
>>>          # Increase the CQ size.
>>>          new_cq_size = 7
>>>          self.client.cq.resize(new_cq_size)
>>>          self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>>                          f'The actual CQ size ({self.client.cq.cqe}) is less '
>>>                          'than guaranteed ({new_cq_size})')
>>>
>>>          # Fill the CQ entries except one for avoid cq_overrun warnings.
>>>          send_wr, _ = u.get_send_elements(self.client, False)
>>>          ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>>>          for i in range(self.client.cq.cqe - 1):
>>>              u.send(self.client, send_wr, ah=ah_client)
>>>
>>>        ### This posts 6 send work requests but does not wait for them to complete
>>>        ### The following proceeds while the sends are executing in the background
>>>        ### and the test can fail. An easy fix is to add the line
>>>          time.sleep(1/1000)  ### or maybe something a little larger but this works for me.
>>>        ### alternatively you could test the data at the destination.
>>>
>>>          # Decrease the CQ size to less than the CQ unpolled entries.
>>>          new_cq_size = 1
>>>          with self.assertRaises(PyverbsRDMAError) as ex:
>>>              self.client.cq.resize(new_cq_size)
>>>          self.assertEqual(ex.exception.error_code, errno.EINVAL)
>>>
>>> Bob
>> Edward,
>>
>> This is showing up again. The software drivers are slower than the hardware drivers.
>> Now that the send side logic in the rxe driver is handled by a workqueue and if it is
>> scheduled instead of running inline this test will fail some percentage of the time.
>> Especially when the test is run for the first time but 5-10% of the time after that.
>> If you run the send side logic inline it is fast enough that the test doesn't fail but
>> that doesn't allow the workqueue tasks to spread out on the available cores so for high
>> QP count it is much better to schedule them.
>>
>> There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and
>> better is time.sleep(0.001) which works fine for this test and doesn't slow down the
>> run time for the whole suite noticeably.
> 
> I prefer not to add a sleep (the only sleeps are in mlx5_vfio test which is required to wait for the port to become up - but it's not run by default until the user specifically configure an mlx5_vfio device).
> 
> What if we poll the CQ for one completion, to make sure that the driver had the time to publish a CQE (which might be enough "time" for us)?
> i.e.
> 
> $ git diff
> diff --git a/tests/test_cq.py b/tests/test_cq.py
> index 98485e3f8..e17676304 100644
> --- a/tests/test_cq.py
> +++ b/tests/test_cq.py
> @@ -123,6 +123,7 @@ class CQTest(RDMATestCase):
>          ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>          for i in range(self.client.cq.cqe - 1):
>              u.send(self.client, send_wr, ah=ah_client)
> +        u.poll_cq(self.client.cq)
> 
>          # Decrease the CQ size to less than the CQ unpolled entries.
>          new_cq_size = 1
> 
> Could you try to see if that solves the issue for you?
> 
>>
>> Bottom line this test case is racy.
>>
>> If you want I can submit a 2 line patch that does this. Or if you would that would be
>> better.
>>
>> Bob

Yes. May take a day or two to get to it but I will definitely try it.

Thanks

Bob