On Fri, Apr 23, 2021 at 09:14:21PM +0800, Mark Zhang wrote: > > > On 4/23/2021 3:34 AM, Jason Gunthorpe wrote: > > On Wed, Apr 21, 2021 at 02:40:37PM +0300, Leon Romanovsky wrote: > > > @@ -4396,6 +4439,14 @@ static void cm_remove_one(struct ib_device *ib_device, void *client_data) > > > cm_dev->going_down = 1; > > > spin_unlock_irq(&cm.lock); > > > + list_for_each_entry_safe(cm_id_priv, tmp, > > > + &cm_dev->cm_id_priv_list, cm_dev_list) { > > > + if (!list_empty(&cm_id_priv->cm_dev_list)) > > > + list_del(&cm_id_priv->cm_dev_list); > > > + cm_id_priv->av.port = NULL; > > > + cm_id_priv->alt_av.port = NULL; > > > + } > > > > Ugh, this is in the wrong order, it has to be after the work queue > > flush.. > > > > Hurm, I didn't see an easy way to fix it up, but I did think of a much > > better design! > > > > Generally speaking all we need is the memory of the cm_dev and port to > > remain active, we don't need to block or fence with cm_remove_one(), > > so just stick a memory kref on this thing and keep the memory. The > > only things that needs to seralize with cm_remove_one() are on the > > workqueue or take a spinlock (eg because they touch mad_agent) > > > > Try this, I didn't finish every detail, applies on top of your series, > > but you'll need to reflow it into new commits: > > Thanks Jason, I think we still need a rwlock to protect "av->port"? It is > modified and cleared by cm_set_av_port() and read in many places. Hum.. This is a real mess. It looks to me like any access to the av->port should always be protected by the cm_id_priv->lock Most already are, but the sets are wrong and a couple readers are wrong Set reverse call chains: cm_init_av_for_lap() cm_lap_handler(work) (ok) cm_init_av_for_response() cm_req_handler(work) (OK, cm_id_priv is on stack) cm_sidr_req_handler(work) (OK, cm_id_priv is on stack) cm_init_av_by_path() cm_req_handler(work) (OK, cm_id_priv is on stack) cm_lap_handler(work) (OK) ib_send_cm_req() (not locked) cma_connect_ib() rdma_connect_locked() [..] ipoib_cm_send_req() srp_send_req() srp_connect_ch() [..] ib_send_cm_sidr_req() (not locked) cma_resolve_ib_udp() rdma_connect_locked() And cm_destroy_id() (locked) And read reverse call chains: cm_alloc_msg() ib_send_cm_req() (not locked) ib_send_cm_rep() (OK) ib_send_cm_rtu() (OK) cm_send_dreq_locked() (OK) cm_send_drep_locked() (OK) cm_send_rej_locked() (OK) ib_send_cm_mra() (OK) ib_send_cm_sidr_req() (not locked) cm_send_sidr_rep_locked() (OK) cm_form_tid() cm_format_req() ib_send_cm_req() (sort of OK) cm_format_dreq() cm_send_dreq_locked (OK) cm_format_req() ib_send_cm_req() (sort of OK) cm_format_req_event() cm_req_handler() (OK, cm_id_priv is on stack) cm_format_rep() ib_send_cm_rep() (OK) cm_rep_handler(work) (OK) cm_establish_handler(work) (OK) cm_rtu_handler(work) (OK) cm_send_dreq_locked() (OK) cm_dreq_handler(work) (OK) cm_drep_handler(work) (OK) cm_rej_handler(work) (OK) cm_mra_handler(work) (OK) cm_apr_handler(work) (OK) cm_sidr_rep_handler(work) (OK) cm_init_qp_init_attr() (OK) cm_init_qp_rtr_attr() (OK) cm_init_qp_rts_attr() (OK) So.. That leaves these functions that are not obviously locked correctly: ib_send_cm_req() ib_send_cm_sidr_req() And the way their locking expects to work is basically because they expect that there are not parallel touches to the cm_id - however I'm doubtful this is completely true. So no new lock needed, but something should be done about the above two functions, and this should be documented Jason