Re: [PATCH rdma-next v2 7/9] IB/cm: Clear all associated AV's ports when remove a cm device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 4/23/2021 10:24 PM, Jason Gunthorpe wrote:
On Fri, Apr 23, 2021 at 09:14:21PM +0800, Mark Zhang wrote:


On 4/23/2021 3:34 AM, Jason Gunthorpe wrote:
On Wed, Apr 21, 2021 at 02:40:37PM +0300, Leon Romanovsky wrote:
@@ -4396,6 +4439,14 @@ static void cm_remove_one(struct ib_device *ib_device, void *client_data)
   	cm_dev->going_down = 1;
   	spin_unlock_irq(&cm.lock);
+	list_for_each_entry_safe(cm_id_priv, tmp,
+				 &cm_dev->cm_id_priv_list, cm_dev_list) {
+		if (!list_empty(&cm_id_priv->cm_dev_list))
+			list_del(&cm_id_priv->cm_dev_list);
+		cm_id_priv->av.port = NULL;
+		cm_id_priv->alt_av.port = NULL;
+	}

Ugh, this is in the wrong order, it has to be after the work queue
flush..

Hurm, I didn't see an easy way to fix it up, but I did think of a much
better design!

Generally speaking all we need is the memory of the cm_dev and port to
remain active, we don't need to block or fence with cm_remove_one(),
so just stick a memory kref on this thing and keep the memory. The
only things that needs to seralize with cm_remove_one() are on the
workqueue or take a spinlock (eg because they touch mad_agent)

Try this, I didn't finish every detail, applies on top of your series,
but you'll need to reflow it into new commits:

Thanks Jason, I think we still need a rwlock to protect "av->port"? It is
modified and cleared by cm_set_av_port() and read in many places.

Hum..

This is a real mess.

It looks to me like any access to the av->port should always be
protected by the cm_id_priv->lock

Most already are, but the sets are wrong and a couple readers are wrong

Set reverse call chains:

cm_init_av_for_lap()
  cm_lap_handler(work) (ok)

cm_init_av_for_response()
  cm_req_handler(work) (OK, cm_id_priv is on stack)
  cm_sidr_req_handler(work) (OK, cm_id_priv is on stack)

cm_init_av_by_path()
  cm_req_handler(work) (OK, cm_id_priv is on stack)
  cm_lap_handler(work) (OK)
  ib_send_cm_req() (not locked)
    cma_connect_ib()
     rdma_connect_locked()
      [..]
    ipoib_cm_send_req()
    srp_send_req()
      srp_connect_ch()
       [..]
  ib_send_cm_sidr_req() (not locked)
   cma_resolve_ib_udp()
    rdma_connect_locked()


Both cm_init_av_for_lap() and cm_init_av_by_path() might sleep (the last patch of this series tries to fix this issue), they cannot be called with cm_id_priv->lock


And
   cm_destroy_id() (locked)

And read reverse call chains:

cm_alloc_msg()
  ib_send_cm_req() (not locked)
  ib_send_cm_rep() (OK)
  ib_send_cm_rtu() (OK)
  cm_send_dreq_locked() (OK)
  cm_send_drep_locked() (OK)
  cm_send_rej_locked() (OK)
  ib_send_cm_mra() (OK)
  ib_send_cm_sidr_req() (not locked)
  cm_send_sidr_rep_locked() (OK)
cm_form_tid()
  cm_format_req()
   ib_send_cm_req() (sort of OK)
  cm_format_dreq()
    cm_send_dreq_locked (OK)
cm_format_req()
   ib_send_cm_req() (sort of OK)
cm_format_req_event()
  cm_req_handler() (OK, cm_id_priv is on stack)
cm_format_rep()
  ib_send_cm_rep() (OK)
cm_rep_handler(work) (OK)
cm_establish_handler(work) (OK)
cm_rtu_handler(work) (OK)
cm_send_dreq_locked() (OK)
cm_dreq_handler(work) (OK)
cm_drep_handler(work) (OK)
cm_rej_handler(work) (OK)
cm_mra_handler(work) (OK)
cm_apr_handler(work) (OK)
cm_sidr_rep_handler(work) (OK)
cm_init_qp_init_attr() (OK)
cm_init_qp_rtr_attr() (OK)
cm_init_qp_rts_attr() (OK)

So.. That leaves these functions that are not obviously locked
correctly:
  ib_send_cm_req()
  ib_send_cm_sidr_req()

And the way their locking expects to work is basically because they
expect that there are not parallel touches to the cm_id - however I'm
doubtful this is completely true.

So no new lock needed, but something should be done about the above
two functions, and this should be documented

Jason




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux