Re: [PATCH for-next] RDMA/rxe: Fix potential race in rxe_pool_get_index

Bob Pearson <rpearsonhpe@xxxxxxxxx> · Fri, 30 Jun 2023 10:33:38 -0500

On 6/29/23 18:18, Jason Gunthorpe wrote:
> On Thu, Jun 29, 2023 at 05:30:24PM -0500, Bob Pearson wrote:
>> Currently the lookup of an object from its index and taking a
>> reference to the object are incorrectly protected by an rcu_read_lock
>> but this does not make the xa_load and the kref_get combination an
>> atomic operation.
>>
>> The various write operations need to share the xarray state in a
>> mixture of user, soft irq and hard irq contexts so the xa_locking
>> must support that.
>>
>> This patch replaces the xa locks with xa_lock_irqsave.
>>
>> Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by xarrays")
>> Signed-off-by: Bob Pearson <rpearsonhpe@xxxxxxxxx>
>> ---
>>  drivers/infiniband/sw/rxe/rxe_pool.c | 24 ++++++++++++++++++------
>>  1 file changed, 18 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>> index 6215c6de3a84..f2b586249793 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>> @@ -119,8 +119,10 @@ void rxe_pool_cleanup(struct rxe_pool *pool)
>>  int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem,
>>  				bool sleepable)
>>  {
>> -	int err;
>> +	struct xarray *xa = &pool->xa;
>> +	unsigned long flags;
>>  	gfp_t gfp_flags;
>> +	int err;
>>  
>>  	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
>>  		goto err_cnt;
>> @@ -138,8 +140,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem,
>>  
>>  	if (sleepable)
>>  		might_sleep();
>> -	err = xa_alloc_cyclic(&pool->xa, &elem->index, NULL, pool->limit,
>> +	xa_lock_irqsave(xa, flags);
>> +	err = __xa_alloc_cyclic(xa, &elem->index, NULL, pool->limit,
>>  			      &pool->next, gfp_flags);
>> +	xa_unlock_irqrestore(xa, flags);
> 
> This stuff doesn't make any sense, the non __ versions already take
> the xa_lock internally.
> 
> Or is this because you need the save/restore version for some reason?
> But that seems unrelated and there should be a lockdep oops to go
> along with it showing the backtrace??

The background here is that we are testing a 256 node system with the Lustre file system and doing
fail-over-fail-back testing which is very high stress. This has uncovered several bugs where this is
just one.
The logic is 1st need to lock the lookup in rxe_pool_get_index() then when we tried to run
with ordinary spin_locks we got lots of deadlocks. These are related to taking spin locks while in
(soft irq) interrupt mode. In theory we could also get called in hard irq mode so might as well
convert the locks to spin_lock_irqsave() which is safe in all cases.
> 
>> @@ -154,15 +158,16 @@ void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
>>  {
>>  	struct rxe_pool_elem *elem;
>>  	struct xarray *xa = &pool->xa;
>> +	unsigned long flags;
>>  	void *obj;
>>  
>> -	rcu_read_lock();
>> +	xa_lock_irqsave(xa, flags);
>>  	elem = xa_load(xa, index);
>>  	if (elem && kref_get_unless_zero(&elem->ref_cnt))
>>  		obj = elem->obj;
>>  	else
>>  		obj = NULL;
>> -	rcu_read_unlock();
>> +	xa_unlock_irqrestore(xa, flags);
> 
> And this should be safe as long as the object is freed via RCU, so
> what are you trying to fix?

The problem here is that rcu_read_lock only helps us if the object is freed with kfree_rcu.
But we have no control over what rdma-core does and it does *not* do that for e.g. qp's.
If this code gets scheduled off the cpu between the xa_load and the kref_get the object can
be completed cleaned up and freed by rdma-core. It isn't likely but it definitely could happen.
This results in a seg fault in kref_get. This is safer and keeps us out of Leon's hair.

Bob
> 
> Jason