Re: [PATCH rdma-next] IB/mlx5: Fix long EEH recover time with NVMe offloads

Jason Gunthorpe <jgg@xxxxxxxx> · Thu, 20 Dec 2018 20:44:58 -0700

On Tue, Dec 18, 2018 at 02:15:56PM +0200, Leon Romanovsky wrote:
> From: Huy Nguyen <huyn@xxxxxxxxxxxx>
> 
> On NVMe offloads connection with many IO queues, EEH takes long time to
> recover. The culprit is the synchronize_srcu in the destroy_mkey. Solution
> is to use synchronize_srcu only for ODP mkey.
> 
> Fixes: b4cfe447d47b ("IB/mlx5: Implement on demand paging by adding support for MMU notifiers")
> Signed-off-by: Huy Nguyen <huyn@xxxxxxxxxxxx>
> Reviewed-by: Daniel Jurgens <danielj@xxxxxxxxxxxx>
> Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> ---
>  drivers/infiniband/hw/mlx5/mr.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)

I'm going to apply this, because it does make sense to reduce the
calls to synchronize_srcu, however I think this design is poor, it
would be better to use call_srcu to do the cleanup/kfree rather than
a full synchronize as this problem will return if there are a large
number of user ODP MRs.

So, I think a followup would be good.

Jason