RE: [PATCH rdma-next 01/10] RDMA: Restore ability to fail on PD deallocate

"Saleem, Shiraz" <shiraz.saleem@xxxxxxxxx> · Thu, 27 Aug 2020 02:06:03 +0000

> Subject: Re: [PATCH rdma-next 01/10] RDMA: Restore ability to fail on PD
> deallocate
> 
> On Wed, Aug 26, 2020 at 12:49:03AM +0000, Saleem, Shiraz wrote:
> 
> > The API is quite confusing now. If drivers are not expected to fail
> > the destroy and there is no way to propagate the device failures, then
> > the return type should be a void.
> 
> More or less, drivers can only return -EAGAIN with the idea that a future call during
> the close process will eventually succeed.
> 
> Any permanent failure will trigger WARN_ON and a memory leak
> 
> Maybe we should switch the return code to bool or something to be a little clearer
> that it is request to retry, not a failure?
> 
There is no retry for kernel object destroy right? So not sure bool is making It clearer.

I am not very familiar with devx flows but doesn’t it bypass the ib verbs layer altogether?
i.e. mlx5_ib_dealloc_pd isn’t directly called in the devx flows no? So why changes its return
type and other provider destroy callbacks?

But lets go down the path that we really need a return code in the destroy APIs to solve this problem.
For one I don’t see how we can say its meant exclusively for devx drivers to use for a fail.
Also can we really claim the API contract is that driver can fail a destroy given a future destroy will succeed?
Since the kernel destroy has no retry.

Which then boils down do we just keep a simpler definition of the API contract -- driver can just return whatever the true error code is?
i.e. if it wants a retry, use EAGAIN. If it has a non recoverable device error, then reset the device, clean up the resources but return ENOTRECOVERABLE.
ib_core can enable the retry logic for EAGAIN _only_.  For other error codes, Ib_core can trigger a warn_on or something to indicate permanent failure.
It can also pass on ret_code to user-space as its doing today?

Shiraz