On 4/23/2021 10:24 PM, Jason Gunthorpe wrote:
On Fri, Apr 23, 2021 at 09:14:21PM +0800, Mark Zhang wrote:
On 4/23/2021 3:34 AM, Jason Gunthorpe wrote:
On Wed, Apr 21, 2021 at 02:40:37PM +0300, Leon Romanovsky wrote:
@@ -4396,6 +4439,14 @@ static void cm_remove_one(struct ib_device *ib_device, void *client_data)
cm_dev->going_down = 1;
spin_unlock_irq(&cm.lock);
+ list_for_each_entry_safe(cm_id_priv, tmp,
+ &cm_dev->cm_id_priv_list, cm_dev_list) {
+ if (!list_empty(&cm_id_priv->cm_dev_list))
+ list_del(&cm_id_priv->cm_dev_list);
+ cm_id_priv->av.port = NULL;
+ cm_id_priv->alt_av.port = NULL;
+ }
Ugh, this is in the wrong order, it has to be after the work queue
flush..
Hurm, I didn't see an easy way to fix it up, but I did think of a much
better design!
Generally speaking all we need is the memory of the cm_dev and port to
remain active, we don't need to block or fence with cm_remove_one(),
so just stick a memory kref on this thing and keep the memory. The
only things that needs to seralize with cm_remove_one() are on the
workqueue or take a spinlock (eg because they touch mad_agent)
Try this, I didn't finish every detail, applies on top of your series,
but you'll need to reflow it into new commits:
Thanks Jason, I think we still need a rwlock to protect "av->port"? It is
modified and cleared by cm_set_av_port() and read in many places.
Hum..
This is a real mess.
It looks to me like any access to the av->port should always be
protected by the cm_id_priv->lock
Most already are, but the sets are wrong and a couple readers are wrong
Set reverse call chains:
cm_init_av_for_lap()
cm_lap_handler(work) (ok)
cm_init_av_for_response()
cm_req_handler(work) (OK, cm_id_priv is on stack)
cm_sidr_req_handler(work) (OK, cm_id_priv is on stack)
cm_init_av_by_path()
cm_req_handler(work) (OK, cm_id_priv is on stack)
cm_lap_handler(work) (OK)
ib_send_cm_req() (not locked)
cma_connect_ib()
rdma_connect_locked()
[..]
ipoib_cm_send_req()
srp_send_req()
srp_connect_ch()
[..]
ib_send_cm_sidr_req() (not locked)
cma_resolve_ib_udp()
rdma_connect_locked()
Both cm_init_av_for_lap() and cm_init_av_by_path() might sleep (the last
patch of this series tries to fix this issue), they cannot be called
with cm_id_priv->lock
And
cm_destroy_id() (locked)
And read reverse call chains:
cm_alloc_msg()
ib_send_cm_req() (not locked)
ib_send_cm_rep() (OK)
ib_send_cm_rtu() (OK)
cm_send_dreq_locked() (OK)
cm_send_drep_locked() (OK)
cm_send_rej_locked() (OK)
ib_send_cm_mra() (OK)
ib_send_cm_sidr_req() (not locked)
cm_send_sidr_rep_locked() (OK)
cm_form_tid()
cm_format_req()
ib_send_cm_req() (sort of OK)
cm_format_dreq()
cm_send_dreq_locked (OK)
cm_format_req()
ib_send_cm_req() (sort of OK)
cm_format_req_event()
cm_req_handler() (OK, cm_id_priv is on stack)
cm_format_rep()
ib_send_cm_rep() (OK)
cm_rep_handler(work) (OK)
cm_establish_handler(work) (OK)
cm_rtu_handler(work) (OK)
cm_send_dreq_locked() (OK)
cm_dreq_handler(work) (OK)
cm_drep_handler(work) (OK)
cm_rej_handler(work) (OK)
cm_mra_handler(work) (OK)
cm_apr_handler(work) (OK)
cm_sidr_rep_handler(work) (OK)
cm_init_qp_init_attr() (OK)
cm_init_qp_rtr_attr() (OK)
cm_init_qp_rts_attr() (OK)
So.. That leaves these functions that are not obviously locked
correctly:
ib_send_cm_req()
ib_send_cm_sidr_req()
And the way their locking expects to work is basically because they
expect that there are not parallel touches to the cm_id - however I'm
doubtful this is completely true.
So no new lock needed, but something should be done about the above
two functions, and this should be documented
Jason