Re: [PATCH v13 for-next 5/6] RDMA/rxe: For mcast copy qp list to temp array

Bob Pearson <rpearsonhpe@xxxxxxxxx> · Thu, 24 Feb 2022 11:29:36 -0600



On 2/23/22 18:26, Jason Gunthorpe wrote:
> On Wed, Feb 23, 2022 at 05:07:07PM -0600, Bob Pearson wrote:
> 
>> +	/* this is the current number of qp's attached to mcg plus a
>> +	 * little room in case new qp's are attached between here
>> +	 * and when we finish walking the qp list. If someone can
>> +	 * attach more than 4 new qp's we will miss forwarding
>> +	 * packets to those qp's. This is actually OK since UD is
>> +	 * a unreliable service.
>> +	 */
>> +	nmax = atomic_read(&mcg->qp_num) + 4;
>> +	qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);
> 
> Check for allocation failure?
> 
>> +	n = 0;
>>  	spin_lock_bh(&rxe->mcg_lock);
>> -
>> -	/* this is unreliable datagram service so we let
>> -	 * failures to deliver a multicast packet to a
>> -	 * single QP happen and just move on and try
>> -	 * the rest of them on the list
>> -	 */
>>  	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
>> -		qp = mca->qp;
>> +		/* protect the qp pointers in the list */
>> +		rxe_add_ref(mca->qp);
>> +		qp_array[n++] = mca->qp;
>> +		if (n == nmax)
>> +			break;
>> +	}
>> +	spin_unlock_bh(&rxe->mcg_lock);
> 
> So the issue here is this needs to iterate over the list, but doesn't
> want to hold a spinlock while it does the actions?
It's not walking the linked list I am worried about but the call to rxe_rcv_pkt()
which can last a long time as it has to copy the payload to user space and then
generate a work completion and possibly a completion event.
> 
> Perhaps the better pattern is switch the qp list to an xarray (and
> delete the mca entirely), then you can use xa_for_each here and avoid
> the allocation. The advantage of xarray is that iteration is
> safely restartable, so the locks can be dropped.
> 
> xa_lock()
> xa_for_each(...) {
>    qp = mca->qp;
>    rxe_add_ref(qp);
>    xa_unlock();
> 
>    [.. stuff without lock ..]
> 
>    rxe_put_ref(qp)
>    xa_lock();
> }
> 
> This would eliminate the rxe_mca entirely as the qp can be stored
> directly in the xarray.
> 
> In most cases I suppose a mcg will have a small number of QP so this
> should be much faster than the linked list, and especially than the
> allocation.
The way the spec is written seems to anticipate that there is a fixed
sized table of qp's for each mcast address. The way this is now written
is sort of a compromise where I am guessing that the table kmalloc is
smaller than the skb clones so just left the linked list around to
make deltas to the list easier and keep separate from the linear list.

An alternative would be to include the array in the mcg struct and
scan the array to attach/detach qp's while holding a lock. The scan
here in rxe_rcv_mcast_pkt doesn't need to be locked as long as the qp's
are stored and cleared atomically. Just check to see if they are
non-zero at the moment the packet arrives. 
> 
> And when I write it like the above you can see the RCU failure you
> mentioned before is just symptom of a larger bug, ie if RXE is doing
> the above and replicating packets at the same instant that close
> happens it will hit that WARN_ON as well since the qp ref is
> temporarily elevated.
> 
> The only way to fully fix that bug is to properly wait for all
> transient users of the QP to be finished before allowing the QP to be
> destroyed.
> 
> But at least this version would make it not happen so reliably and
> things can move on.
> 
> Jason