Re: [PATCH rdma-next] RDMA/restrack: Delay QP deletion till all users are gone

Leon Romanovsky <leon@xxxxxxxxxx> · Sun, 25 Apr 2021 16:44:55 +0300

On Sun, Apr 25, 2021 at 10:08:57AM -0300, Jason Gunthorpe wrote:
> On Sun, Apr 25, 2021 at 04:03:47PM +0300, Leon Romanovsky wrote:
> > On Thu, Apr 22, 2021 at 11:29:39AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 21, 2021 at 08:03:22AM +0300, Leon Romanovsky wrote:
> > > 
> > > > I didn't understand when reviewed either, but decided to post it anyway
> > > > to get possible explanation for this RDMA_RESTRACK_MR || RDMA_RESTRACK_QP
> > > > check.
> > > 
> > > I think the whole thing should look more like this and we delete the
> > > if entirely.
> > 
> > I have mixed feelings about this approach. Before "destroy can fail disaster",
> > the restrack goal was to provide the following flow:
> > 1. create new memory object - rdma_restrack_new()
> > 2. create new HW object - .create_XXX() callback in the driver
> > 3. add HW object to the DB - rdma_restrack_del()
> > ....
> > 4. wait for any work on this HW object to complete - rdma_restrack_del()
> > 5. safely destroy HW object - .destroy_XXX()
> > 
> > I really would like to stay with this flow and block any access to the
> > object that failed to destruct - maybe add to some zombie list.
> 
> That isn't the semantic we now have for destroy.

I would say that it is my mistake introduced when changed destroy to
return an error.

>  
> > The proposed prepare/abort/finish flow is much harder to implement correctly.
> > Let's take as an example ib_destroy_qp_user(), we called to rdma_rw_cleanup_mrs(),
> > but didn't restore them after .destroy_qp() failure.
> 
> I think it is a bug we call rdma_rw code in a a user path.

It was an example of a flow that wasn't restored properly. 
The same goes for ib_dealloc_pd_user(), release of __internal_mr.

Of course, these flows shouldn't fail because of being kernel flows, but it is not clear
from the code.

Thanks

> 
> Jason