Re: Are device drivers ready for max_send_sge ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Aug 29, 2018, at 1:50 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> Hey Chuck,
> 
>> -----Original Message-----
>> From: linux-rdma-owner@xxxxxxxxxxxxxxx <linux-rdma-
>> owner@xxxxxxxxxxxxxxx> On Behalf Of Chuck Lever
>> Sent: Wednesday, August 29, 2018 11:49 AM
>> To: linux-rdma <linux-rdma@xxxxxxxxxxxxxxx>
>> Cc: Leon Romanovsky <leon@xxxxxxxxxx>; Steve Wise
>> <swise@xxxxxxxxxxxxxxxxxxxxx>
>> Subject: Are device drivers ready for max_send_sge ?
>> 
>> In v4.19-rc1, the NFS/RDMA server has stopped working
>> for me.
>> 
>> The reason for this is that the mlx4 driver (with CX-3)
>> reports 62 in
>> 
>>  ib_device_attr::max_send_sge
>> 
>> But when the NFS server tries to create a QP with a
>> qp_attr.cap.max_send_sge set to 62, rdma_create_qp
>> fails with -EINVAL. The check that fails is in
>> 
>>  drivers/infiniband/hw/mlx4/qp.c :: set_kernel_sq_size
>> 
>> It's comparing the passed-in max_send_sge against the min()
>> of dev->dev->caps.max_sq_sg and dev->dev->caps.max_rq_sg,
>> and obviously that fails because max_rq_sg is smaller than
>> max_sq_sg.
>> 
>> set_rq_size() also has similar dependencies on max_sq_sg
>> that may no longer be appropriate.
>> 
> 
> These need fixing for sure.
> 
>> When I fix the first sanity check in set_kernel_sq_size
>> to ignore max_rq_sg, the third check in set_kernel_sq_size
>> fails. This is:
>> 
>>  s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg),
>>          cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) +
>>          send_wqe_overhead(type, qp->flags);
>> 
>>  if (s > dev->dev->caps.max_sq_desc_sz)
>> 
>> I don't know enough about this logic to suggest a fix.
>> 
> 
> If you set ib_device_attr::max_send_sge to 61, does it then work?  Just
> wonder if this is a 1-off error.

I made this change to reduce ib_device_attr::max_send_sge
to 61:

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index babcfd9..81874f6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -950,7 +950,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
        }
 
        MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_SG_SQ_OFFSET);
-       dev_cap->max_sq_sg = field;
+       dev_cap->max_sq_sg = field - 1;
        MLX4_GET(size, outbox, QUERY_DEV_CAP_MAX_DESC_SZ_SQ_OFFSET);
        dev_cap->max_sq_desc_sz = size;


I added printk's to set_kernel_sq_size to capture the parameters
of the failing sanity check.

Here's with max_send_sge == 62:

Aug 29 12:26:01 klimt kernel: <mlx4_ib> set_kernel_sq_size: s = 1056
Aug 29 12:26:01 klimt kernel: <mlx4_ib> set_kernel_sq_size: dev->dev->caps.max_sq_desc_sz = 1008

and with max_send_sge == 61:

Aug 29 13:59:40 klimt kernel: <mlx4_ib> set_kernel_sq_size: s = 1040
Aug 29 13:59:40 klimt kernel: <mlx4_ib> set_kernel_sq_size: dev->dev->caps.max_sq_desc_sz = 1008


>> Is there a driver-level fix in the works, or should I
>> consider changing the NFS server to compute a smaller
>> qp_attr.cap.max_send_sge ?
>> 
>> --
>> Chuck Lever

--
Chuck Lever







[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux