Re: [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage

Erez Shitrit <erezsh@xxxxxxxxxxxxxxxxxx> · Tue, 03 Mar 2015 11:53:21 +0200

On 3/2/2015 5:27 PM, Doug Ledford wrote:
On Sun, 2015-03-01 at 11:31 +0200, Erez Shitrit wrote:
On 2/22/2015 2:27 AM, Doug Ledford wrote:
Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used.  We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it.  And when we did
complete it, sometimes we continued to touch the mcast entry after
the completion, opening us up to possible use after free issues.

This made it less than totally effective, and certainly made its use
confusing.  And in the flush function we would use the presence of this
flag to signal that we should wait on the completion struct, but we never
cleared this flag, ever.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

   1) Remove the MCAST_JOIN_STARTED flag entirely
   2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
   3) Test mcast->mc directly to see if we have completed
      ib_sa_join_multicast (using IS_ERR_OR_NULL)
   4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
      the mcast->done completion struct
   5) Make sure that before calling complete(&mcast->done), we always clear
      the MCAST_FLAG_BUSY bit
   6) Take the mcast_mutex before we call ib_sa_multicast_join and also
      take the mutex in our join callback.  This forces
      ib_sa_multicast_join to return and set mcast->mc before we process
      the callback.  This way, our callback can safely clear mcast->mc
      if there is an error on the join and we will do the right thing as
      a result in mcast_dev_flush.
   7) Because we need the mutex to synchronize mcast->mc, we can no
      longer call mcast_sendonly_join directly from mcast_send and
      instead must add sendonly join processing to the mcast_join_task
   8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
      we have a running task.  We know when we need to reschedule our
      join task thread and don't need a flag to tell us.
   9) Add a helper for rescheduling the join task thread

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1                             Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
    alloc member
    call callback
                                       ifconfig ib0 down
				     wait_for_completion
      callback call completes
                                       wait_for_completion in
				     mcast_dev_flush completes
				       mcast->mc is PTR_ERR_OR_NULL
				       so we skip ib_sa_leave_multicast
      return from callback
    return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1                   Thread 2         Thread 3
ib_sa_multicast_join
                                              ifconfig ib0 down
					    priv->broadcast = NULL
                             join_complete
			                    wait_for_completion
			   mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
			   complete
			   return -EAGAIN (making mcast->mc invalid)
			   		    call ib_sa_multicast_leave
					    on invalid mcast->mc, hang
					    forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback.  This allows us to clear mcast->mc if there is an
error and the join is going to fail.  We do this before we complete
the mcast.  In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.

Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>
---
   drivers/infiniband/ulp/ipoib/ipoib.h           |  11 +-
   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 ++++++++++++++++---------
   2 files changed, 238 insertions(+), 128 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9ef432ae72e..c79dcd5ee8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
   
   	IPOIB_MCAST_FLAG_FOUND	  = 0,	/* used in set_multicast_list */
   	IPOIB_MCAST_FLAG_SENDONLY = 1,
-	IPOIB_MCAST_FLAG_BUSY	  = 2,	/* joining or already joined */
+	/*
+	 * For IPOIB_MCAST_FLAG_BUSY
+	 * When set, in flight join and mcast->mc is unreliable
+	 * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
+	 *   haven't started yet
+	 * When clear and mcast->mc is valid pointer, join was successful
+	 */
+	IPOIB_MCAST_FLAG_BUSY	  = 2,
   	IPOIB_MCAST_FLAG_ATTACHED = 3,
-	IPOIB_MCAST_JOIN_STARTED  = 4,
   
   	MAX_SEND_CQE		  = 16,
   	IPOIB_CM_COPYBREAK	  = 256,
@@ -148,6 +154,7 @@ struct ipoib_mcast {
   
   	unsigned long created;
   	unsigned long backoff;
+	unsigned long delay_until;
   
   	unsigned long flags;
   	unsigned char logcount;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index bb1b69904f9..277e7ac7c4d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -66,6 +66,48 @@ struct ipoib_mcast_iter {
   	unsigned int       send_only;
   };
   
+/*
+ * This should be called with the mcast_mutex held
+ */
+static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv,
+					       struct ipoib_mcast *mcast,
+					       bool delay)
+{
+	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
You don't need the flag IPOIB_MCAST_RUN, it is duplicated of
IPOIB_FLAG_OPER_UP
probably, need to be taken from all places (including ipoib.h file).
This is probably true, however I skipped it for this series of patches.
It wasn't a requirement of proper operation, and depending on where in
the patch series you tried to inject this change, it had unintended
negative consequences.  In particular, up until patch 7/9,
mcast_restart_task used to do a hand rolled stop thread and a matching
start_thread and so couldn't use FLAG_OPER_UP because we needed
FLAG_OPER_UP to tell us whether or not to restart the thread.

OK, sounds reasonable. we can send a patch after that that fix that.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html