On 2024/1/5 22:19, Jason Gunthorpe wrote:
On Fri, Jan 05, 2024 at 04:15:18PM +0800, Shifeng Li wrote:
On 2024/1/4 20:37, Jason Gunthorpe wrote:
On Thu, Jan 04, 2024 at 02:48:14PM +0800, Shifeng Li wrote:
The root cause is that mad_client and cm_client may init concurrently
when devices_rwsem write semaphore is downgraded in enable_device_and_get() like:
That can't be true, the module loader infrastructue ensures those two
things are sequential.
I'm a bit confused how the module loader infrastructue ensures that mad_client.add() and
cm_client.add() are sequential. Could you explain in more detail
please?
ib_cm has a symbol dependency on ib_mad, so the module loader will not
allow ib_cm to start running until all its symbol dependencies have
completed loading.
I have found a method to reproduce the issue as follow:
1. modprobe -r ib_cm; modprobe -r ib_core; modprobe -r mlx5_ib;
2. Compile and replace ib_core with following patch;
3. modprobe ib_cm;
4. modprobe mlx5_ib;
diff -Narup ./drivers/infiniband/core/device.c ./drivers/infiniband/core/device.c.reproduce
--- ./drivers/infiniband/core/device.c 2024-01-15 11:14:08.063094430 +0800
+++ ./drivers/infiniband/core/device.c.reproduce 2024-01-15 11:16:22.577096953 +0800
@@ -64,6 +64,8 @@ struct workqueue_struct *ib_wq;
EXPORT_SYMBOL_GPL(ib_wq);
static struct workqueue_struct *ib_unreg_wq;
+int ib_cm_run_flag = 0;
+
/*
* Each of the three rwsem locks (devices, clients, client_data) protects the
* xarray of the same name. Specifically it allows the caller to assert that
@@ -1371,6 +1373,9 @@ static int enable_device_and_get(struct
*/
downgrade_write(&devices_rwsem);
+ ib_cm_run_flag = 1;
+ msleep(30000);
+
if (device->ops.enable_driver) {
ret = device->ops.enable_driver(device);
if (ret)
@@ -1843,6 +1848,12 @@ int ib_register_client(struct ib_client
if (ret)
return ret;
+ if (strncmp(client->name, "cm", strlen("cm")) == 0) {
+ while (!ib_cm_run_flag) {
+ cond_resched();
+ }
+ }
+
down_read(&devices_rwsem);
xa_for_each_marked (&devices, index, device, DEVICE_REGISTERED) {
ret = add_client_context(device, client);
We know that the ib_cm driver and mlx5_ib driver can load concurrently.
Yes, this is possible
Jason