On Sun, Aug 30, 2020 at 11:40:03AM +0300, Leon Romanovsky wrote: > From: Leon Romanovsky <leonro@xxxxxxxxxxxx> > > The HW release can fail and leave the system in limbo state, > where SRQ is removed from the table, but can't be destroyed later. > In every reentry, the initial xa_erase_irq() check will fail. > > Rewrite the erase logic to keep index, but don't store the entry > itself. By doing it, we can safely reinsert entry back in the case > of destroy failure and be safe from any xa_store_irq() error. > > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx> > drivers/infiniband/hw/mlx5/srq_cmd.c | 15 ++++++++++++--- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c b/drivers/infiniband/hw/mlx5/srq_cmd.c > index 37aaacebd3f2..c6d807f04d9d 100644 > +++ b/drivers/infiniband/hw/mlx5/srq_cmd.c > @@ -596,13 +596,22 @@ void mlx5_cmd_destroy_srq(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq) > struct mlx5_core_srq *tmp; > int err; > > - tmp = xa_erase_irq(&table->array, srq->srqn); > - if (!tmp || tmp != srq) > + /* Delete entry, but leave index occupied */ > + tmp = xa_store_irq(&table->array, srq->srqn, NULL, 0); > + if (WARN_ON(!tmp || tmp != srq)) > return; This isn't an allocating xarray: xa_init_flags(&table->array, XA_FLAGS_LOCK_IRQ); So storing NULL actually does delete the entry and clean up the memory and the store below could fail. I think this should be written as xa_cmpxchg_irq(&table->array, srq->srqn, srq, XA_ZERO_ENTRY, 0); And the undo below would be xa_cmpxchg_irq(&table->array, srq->srqn, XA_ZERO_ENTRY, srq 0); > + xa_erase_irq(&table->array, srq->srqn); And this is racy since the FW could have reallocated the same srqn and already set it in the xarray. It needs to be xa_release_irq(), which looks like it needs to be added to match xa_reserve_irq() Jason