Re: Fwd: pyverbs test_resize_cq fails in some cases

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 7/18/2023 5:08 AM, Bob Pearson wrote:
External email: Use caution opening links or attachments


On 3/23/23 21:38, Bob Pearson wrote:


-------- Forwarded Message --------
Subject: pyverbs test_resize_cq fails in some cases
Date: Thu, 23 Mar 2023 21:37:28 -0500
From: Bob Pearson <rpearsonhpe@xxxxxxxxx>
To: Edward Srouji <edwards@xxxxxxxxxx>
CC: Jason Gunthorpe <jgg@xxxxxxxxxx>, Zhu Yanjun <zyjzyj2000@xxxxxxxxx>

Edward,

The pyverbs test: test_resize_cq fails for the rxe driver in some cases.

The code is definitely racy:

def test_resize_cq(self):
         """
         Test resize CQ, start with specific value and then increase and decrease
         the CQ size. The test also check bad flow of decrease the CQ size when
         there are more completions on it than the new value.
         """
         self.create_players(CQUDResources, cq_depth=3)
         # Decrease the CQ size.
         new_cq_size = 1
         try:
             self.client.cq.resize(new_cq_size)
         except PyverbsRDMAError as ex:
             if ex.error_code == errno.EOPNOTSUPP:
                 raise unittest.SkipTest('Resize CQ is not supported')
             raise ex
         self.assertTrue(self.client.cq.cqe >= new_cq_size,
                         f'The actual CQ size ({self.client.cq.cqe}) is less '
                         'than guaranteed ({new_cq_size})')

         # Increase the CQ size.
         new_cq_size = 7
         self.client.cq.resize(new_cq_size)
         self.assertTrue(self.client.cq.cqe >= new_cq_size,
                         f'The actual CQ size ({self.client.cq.cqe}) is less '
                         'than guaranteed ({new_cq_size})')

         # Fill the CQ entries except one for avoid cq_overrun warnings.
         send_wr, _ = u.get_send_elements(self.client, False)
         ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
         for i in range(self.client.cq.cqe - 1):
             u.send(self.client, send_wr, ah=ah_client)

       ### This posts 6 send work requests but does not wait for them to complete
       ### The following proceeds while the sends are executing in the background
       ### and the test can fail. An easy fix is to add the line
         time.sleep(1/1000)  ### or maybe something a little larger but this works for me.
       ### alternatively you could test the data at the destination.

         # Decrease the CQ size to less than the CQ unpolled entries.
         new_cq_size = 1
         with self.assertRaises(PyverbsRDMAError) as ex:
             self.client.cq.resize(new_cq_size)
         self.assertEqual(ex.exception.error_code, errno.EINVAL)

Bob
Edward,

This is showing up again. The software drivers are slower than the hardware drivers.
Now that the send side logic in the rxe driver is handled by a workqueue and if it is
scheduled instead of running inline this test will fail some percentage of the time.
Especially when the test is run for the first time but 5-10% of the time after that.
If you run the send side logic inline it is fast enough that the test doesn't fail but
that doesn't allow the workqueue tasks to spread out on the available cores so for high
QP count it is much better to schedule them.

There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and
better is time.sleep(0.001) which works fine for this test and doesn't slow down the
run time for the whole suite noticeably.

I prefer not to add a sleep (the only sleeps are in mlx5_vfio test which is required to wait for the port to become up - but it's not run by default until the user specifically configure an mlx5_vfio device).

What if we poll the CQ for one completion, to make sure that the driver had the time to publish a CQE (which might be enough "time" for us)?
i.e.

$ git diff
diff --git a/tests/test_cq.py b/tests/test_cq.py
index 98485e3f8..e17676304 100644
--- a/tests/test_cq.py
+++ b/tests/test_cq.py
@@ -123,6 +123,7 @@ class CQTest(RDMATestCase):
         ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
         for i in range(self.client.cq.cqe - 1):
             u.send(self.client, send_wr, ah=ah_client)
+        u.poll_cq(self.client.cq)

         # Decrease the CQ size to less than the CQ unpolled entries.
         new_cq_size = 1

Could you try to see if that solves the issue for you?


Bottom line this test case is racy.

If you want I can submit a 2 line patch that does this. Or if you would that would be
better.

Bob



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux