Re: [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 23 Feb 2022 20:26:06 -0400

On Wed, Feb 23, 2022 at 05:07:07PM -0600, Bob Pearson wrote:

> +	/* this is the current number of qp's attached to mcg plus a
> +	 * little room in case new qp's are attached between here
> +	 * and when we finish walking the qp list. If someone can
> +	 * attach more than 4 new qp's we will miss forwarding
> +	 * packets to those qp's. This is actually OK since UD is
> +	 * a unreliable service.
> +	 */
> +	nmax = atomic_read(&mcg->qp_num) + 4;
> +	qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);

Check for allocation failure?

> +	n = 0;
>  	spin_lock_bh(&rxe->mcg_lock);
> -
> -	/* this is unreliable datagram service so we let
> -	 * failures to deliver a multicast packet to a
> -	 * single QP happen and just move on and try
> -	 * the rest of them on the list
> -	 */
>  	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
> -		qp = mca->qp;
> +		/* protect the qp pointers in the list */
> +		rxe_add_ref(mca->qp);
> +		qp_array[n++] = mca->qp;
> +		if (n == nmax)
> +			break;
> +	}
> +	spin_unlock_bh(&rxe->mcg_lock);

So the issue here is this needs to iterate over the list, but doesn't
want to hold a spinlock while it does the actions?

Perhaps the better pattern is switch the qp list to an xarray (and
delete the mca entirely), then you can use xa_for_each here and avoid
the allocation. The advantage of xarray is that iteration is
safely restartable, so the locks can be dropped.

xa_lock()
xa_for_each(...) {
   qp = mca->qp;
   rxe_add_ref(qp);
   xa_unlock();

   [.. stuff without lock ..]

   rxe_put_ref(qp)
   xa_lock();
}

This would eliminate the rxe_mca entirely as the qp can be stored
directly in the xarray.

In most cases I suppose a mcg will have a small number of QP so this
should be much faster than the linked list, and especially than the
allocation.

And when I write it like the above you can see the RCU failure you
mentioned before is just symptom of a larger bug, ie if RXE is doing
the above and replicating packets at the same instant that close
happens it will hit that WARN_ON as well since the qp ref is
temporarily elevated.

The only way to fully fix that bug is to properly wait for all
transient users of the QP to be finished before allowing the QP to be
destroyed.

But at least this version would make it not happen so reliably and
things can move on.

Jason