RE: Are device drivers ready for max_send_sge ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Chuck Lever <chuck.lever@xxxxxxxxxx>
> Sent: Wednesday, August 29, 2018 1:09 PM
> To: Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx>
> Cc: linux-rdma <linux-rdma@xxxxxxxxxxxxxxx>; Leon Romanovsky
> <leon@xxxxxxxxxx>
> Subject: Re: Are device drivers ready for max_send_sge ?
> 
> 
> 
> > On Aug 29, 2018, at 1:50 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx>
> wrote:
> >
> > Hey Chuck,
> >
> >> -----Original Message-----
> >> From: linux-rdma-owner@xxxxxxxxxxxxxxx <linux-rdma-
> >> owner@xxxxxxxxxxxxxxx> On Behalf Of Chuck Lever
> >> Sent: Wednesday, August 29, 2018 11:49 AM
> >> To: linux-rdma <linux-rdma@xxxxxxxxxxxxxxx>
> >> Cc: Leon Romanovsky <leon@xxxxxxxxxx>; Steve Wise
> >> <swise@xxxxxxxxxxxxxxxxxxxxx>
> >> Subject: Are device drivers ready for max_send_sge ?
> >>
> >> In v4.19-rc1, the NFS/RDMA server has stopped working
> >> for me.
> >>
> >> The reason for this is that the mlx4 driver (with CX-3)
> >> reports 62 in
> >>
> >>  ib_device_attr::max_send_sge
> >>
> >> But when the NFS server tries to create a QP with a
> >> qp_attr.cap.max_send_sge set to 62, rdma_create_qp
> >> fails with -EINVAL. The check that fails is in
> >>
> >>  drivers/infiniband/hw/mlx4/qp.c :: set_kernel_sq_size
> >>
> >> It's comparing the passed-in max_send_sge against the min()
> >> of dev->dev->caps.max_sq_sg and dev->dev->caps.max_rq_sg,
> >> and obviously that fails because max_rq_sg is smaller than
> >> max_sq_sg.
> >>
> >> set_rq_size() also has similar dependencies on max_sq_sg
> >> that may no longer be appropriate.
> >>
> >
> > These need fixing for sure.
> >
> >> When I fix the first sanity check in set_kernel_sq_size
> >> to ignore max_rq_sg, the third check in set_kernel_sq_size
> >> fails. This is:
> >>
> >>  s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg),
> >>          cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) +
> >>          send_wqe_overhead(type, qp->flags);
> >>
> >>  if (s > dev->dev->caps.max_sq_desc_sz)
> >>
> >> I don't know enough about this logic to suggest a fix.
> >>
> >
> > If you set ib_device_attr::max_send_sge to 61, does it then work?  Just
> > wonder if this is a 1-off error.
> 
> I made this change to reduce ib_device_attr::max_send_sge
> to 61:
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c
> b/drivers/net/ethernet/mellanox/mlx4/fw.c
> index babcfd9..81874f6 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/fw.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
> @@ -950,7 +950,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev,
> struct mlx4_dev_cap *dev_cap)
>         }
> 
>         MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_SG_SQ_OFFSET);
> -       dev_cap->max_sq_sg = field;
> +       dev_cap->max_sq_sg = field - 1;
>         MLX4_GET(size, outbox, QUERY_DEV_CAP_MAX_DESC_SZ_SQ_OFFSET);
>         dev_cap->max_sq_desc_sz = size;
> 
> 
> I added printk's to set_kernel_sq_size to capture the parameters
> of the failing sanity check.
> 
> Here's with max_send_sge == 62:
> 
> Aug 29 12:26:01 klimt kernel: <mlx4_ib> set_kernel_sq_size: s = 1056
> Aug 29 12:26:01 klimt kernel: <mlx4_ib> set_kernel_sq_size: dev->dev-
> >caps.max_sq_desc_sz = 1008
> 
> and with max_send_sge == 61:
> 
> Aug 29 13:59:40 klimt kernel: <mlx4_ib> set_kernel_sq_size: s = 1040
> Aug 29 13:59:40 klimt kernel: <mlx4_ib> set_kernel_sq_size: dev->dev-
> >caps.max_sq_desc_sz = 1008
> 

I guess mlx4_ib_query_device() could be enhanced to limit the max:

ib_device_attr::max_send_sge = min(fw-advertised-max, (max_sq_desc_space -
max_wqe_overhead) / sge_descriptor_size);


Steve.




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux