On Sat, Jan 06, 2024 at 10:12:17AM +0800, Ding Hui wrote: > On 2024/1/4 20:37, Jason Gunthorpe wrote: > > On Thu, Jan 04, 2024 at 02:48:14PM +0800, Shifeng Li wrote: > > > > > The root cause is that mad_client and cm_client may init concurrently > > > when devices_rwsem write semaphore is downgraded in enable_device_and_get() like: > > > > That can't be true, the module loader infrastructue ensures those two > > things are sequential. > > > > Please consider the sequence again and notice that: > > 1. We agree that dependencies ensure mad_client be registered before cm_client. > 2. But the mad_client.add() is not invoked in ib_register_client(), since > there is no DEVICE_REGISTERED device at that time. > Instead, it will be delayed until the device driver init (e.g. mlx5_core) > in enable_device_and_get(). > 3. The ib_cm and mlx5_core can be loaded concurrently, after setting DEVICE_REGISTERED > and downgrade_write(&devices_rwsem) in enable_device_and_get(), there is a chance > that cm_client.add() can be invoked before mad_client.add(). > > > T1(ib_core init) | T2(device driver init) | T3(ib_cm init) > --------------------------------------------------------------------------------------------------- > ib_register_client mad_client > assign_client_id > add clients CLIENT_REGISTERED > (with clients_rwsem write) > down_read(&devices_rwsem); > xa_for_each_marked (&devices, DEVICE_REGISTERED) > nop # no devices > up_read(&devices_rwsem); > > ib_register_device > enable_device_and_get > down_write(&devices_rwsem); > set DEVICE_REGISTERED > downgrade_write(&devices_rwsem); > ib_register_client cm_client > assign_client_id > add clients CLIENT_REGISTERED > (with clients_rwsem write) > down_read(&devices_rwsem); > xa_for_each_marked (&devices, DEVICE_REGISTERED) > add_client_context > down_write(&device->client_data_rwsem); > get CLIENT_DATA_REGISTERED > downgrade_write(&device->client_data_rwsem); > cm_client.add > cm_add_one > ib_register_mad_agent > ib_get_mad_port > __ib_get_mad_port return NULL! > set CLIENT_DATA_REGISTERED > up_read(&device->client_data_rwsem); > up_read(&devices_rwsem); > down_read(&clients_rwsem); > xa_for_each_marked (&clients, CLIENT_REGISTERED) > add_client_context [mad] > mad_client.add > add_client_context [cm] > nop # already CLIENT_DATA_REGISTERED > up_read(&clients_rwsem); > up_read(&devices_rwsem); Take the draft I sent previously and use down_write(&devices_rwsem) in ib_register_client() Jason