On 7/18/23 10:13, Edward Srouji wrote: > > On 7/18/2023 5:08 AM, Bob Pearson wrote: >> External email: Use caution opening links or attachments >> >> >> On 3/23/23 21:38, Bob Pearson wrote: >>> >>> >>> -------- Forwarded Message -------- >>> Subject: pyverbs test_resize_cq fails in some cases >>> Date: Thu, 23 Mar 2023 21:37:28 -0500 >>> From: Bob Pearson <rpearsonhpe@xxxxxxxxx> >>> To: Edward Srouji <edwards@xxxxxxxxxx> >>> CC: Jason Gunthorpe <jgg@xxxxxxxxxx>, Zhu Yanjun <zyjzyj2000@xxxxxxxxx> >>> >>> Edward, >>> >>> The pyverbs test: test_resize_cq fails for the rxe driver in some cases. >>> >>> The code is definitely racy: >>> >>> def test_resize_cq(self): >>> """ >>> Test resize CQ, start with specific value and then increase and decrease >>> the CQ size. The test also check bad flow of decrease the CQ size when >>> there are more completions on it than the new value. >>> """ >>> self.create_players(CQUDResources, cq_depth=3) >>> # Decrease the CQ size. >>> new_cq_size = 1 >>> try: >>> self.client.cq.resize(new_cq_size) >>> except PyverbsRDMAError as ex: >>> if ex.error_code == errno.EOPNOTSUPP: >>> raise unittest.SkipTest('Resize CQ is not supported') >>> raise ex >>> self.assertTrue(self.client.cq.cqe >= new_cq_size, >>> f'The actual CQ size ({self.client.cq.cqe}) is less ' >>> 'than guaranteed ({new_cq_size})') >>> >>> # Increase the CQ size. >>> new_cq_size = 7 >>> self.client.cq.resize(new_cq_size) >>> self.assertTrue(self.client.cq.cqe >= new_cq_size, >>> f'The actual CQ size ({self.client.cq.cqe}) is less ' >>> 'than guaranteed ({new_cq_size})') >>> >>> # Fill the CQ entries except one for avoid cq_overrun warnings. >>> send_wr, _ = u.get_send_elements(self.client, False) >>> ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port) >>> for i in range(self.client.cq.cqe - 1): >>> u.send(self.client, send_wr, ah=ah_client) >>> >>> ### This posts 6 send work requests but does not wait for them to complete >>> ### The following proceeds while the sends are executing in the background >>> ### and the test can fail. An easy fix is to add the line >>> time.sleep(1/1000) ### or maybe something a little larger but this works for me. >>> ### alternatively you could test the data at the destination. >>> >>> # Decrease the CQ size to less than the CQ unpolled entries. >>> new_cq_size = 1 >>> with self.assertRaises(PyverbsRDMAError) as ex: >>> self.client.cq.resize(new_cq_size) >>> self.assertEqual(ex.exception.error_code, errno.EINVAL) >>> >>> Bob >> Edward, >> >> This is showing up again. The software drivers are slower than the hardware drivers. >> Now that the send side logic in the rxe driver is handled by a workqueue and if it is >> scheduled instead of running inline this test will fail some percentage of the time. >> Especially when the test is run for the first time but 5-10% of the time after that. >> If you run the send side logic inline it is fast enough that the test doesn't fail but >> that doesn't allow the workqueue tasks to spread out on the available cores so for high >> QP count it is much better to schedule them. >> >> There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and >> better is time.sleep(0.001) which works fine for this test and doesn't slow down the >> run time for the whole suite noticeably. > > I prefer not to add a sleep (the only sleeps are in mlx5_vfio test which is required to wait for the port to become up - but it's not run by default until the user specifically configure an mlx5_vfio device). > > What if we poll the CQ for one completion, to make sure that the driver had the time to publish a CQE (which might be enough "time" for us)? > i.e. > > $ git diff > diff --git a/tests/test_cq.py b/tests/test_cq.py > index 98485e3f8..e17676304 100644 > --- a/tests/test_cq.py > +++ b/tests/test_cq.py > @@ -123,6 +123,7 @@ class CQTest(RDMATestCase): > ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port) > for i in range(self.client.cq.cqe - 1): > u.send(self.client, send_wr, ah=ah_client) > + u.poll_cq(self.client.cq) > > # Decrease the CQ size to less than the CQ unpolled entries. > new_cq_size = 1 > > Could you try to see if that solves the issue for you? > >> >> Bottom line this test case is racy. >> >> If you want I can submit a 2 line patch that does this. Or if you would that would be >> better. >> >> Bob Yes. May take a day or two to get to it but I will definitely try it. Thanks Bob