RE: Fwd: mlx5_ib_post_send panic on s390x

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ursula,

For the suggestion it still need to continue to check for len >= INLINE_ALIGN - off because 44 = 64-20.
Which is still a  valid case (len == inline - off).
But I agree that it shouldn't do 2nd memcpy with zero length.
Therefore there should be additional check for len != 0.

Coming to IB_SEND_INLINE_DATA part, when ib_create_qp is called and if HCA doesn't support cap.max_inline_data, provider HCA driver is supposed to fail the call.
And ULP is expected to do fallback to non_inline scheme.

As it appears mlx4 driver is not failing this call, which is a bug that needs fix.
Instead of failing the call, I prefer to provide the data path sooner based on my inline patch in this email thread.

Parav

> -----Original Message-----
> From: Ursula Braun [mailto:ubraun@xxxxxxxxxxxxxxxxxx]
> Sent: Thursday, March 16, 2017 6:51 AM
> To: Parav Pandit <parav@xxxxxxxxxxxx>; Eli Cohen <eli@xxxxxxxxxxxx>;
> Matan Barak <matanb@xxxxxxxxxxxx>
> Cc: Saeed Mahameed <saeedm@xxxxxxxxxxxx>; Leon Romanovsky
> <leonro@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
> Subject: Re: Fwd: mlx5_ib_post_send panic on s390x
> 
> Hi Parav,
> 
> I run your new mlx4-Code together with changed SMC-R code no longer
> mapping the IB_SEND_INLINE area. It worked - great!
> 
> Below I have added a small improvement idea in your patch.
> 
> Nevertheless I am still not sure, if I should keep the IB_SEND_INLINE flag in
> the SMC-R code, since there is no guarantee that this will work with all kinds
> of RoCE-devices. The maximum length for IB_SEND_INLINE depends on the
> RoCE-driver - right? Is there an interface to determine such a maximum
> length? Would ib_create_qp() return with an error, if the SMC-R specified
> .cap.max_inline_data = 44 is not supported by a RoCE-driver?
> 
> On 03/14/2017 04:24 PM, Parav Pandit wrote:
> > Hi Ursula,
> >
> >
> >> -----Original Message-----
> >> From: Ursula Braun [mailto:ubraun@xxxxxxxxxxxxxxxxxx]
> >> Sent: Tuesday, March 14, 2017 10:02 AM
> >> To: Parav Pandit <parav@xxxxxxxxxxxx>; Eli Cohen <eli@xxxxxxxxxxxx>;
> >> Matan Barak <matanb@xxxxxxxxxxxx>
> >> Cc: Saeed Mahameed <saeedm@xxxxxxxxxxxx>; Leon Romanovsky
> >> <leonro@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
> >> Subject: Re: Fwd: mlx5_ib_post_send panic on s390x
> >>
> >> Hi Parav,
> >>
> >> I tried your mlx4-patch together with SMC on s390x, but it failed.
> >> The SMC-R code tries to send 44 bytes as inline in 1 sge.
> >> I wonder about a length check with 16 bytes, which probably explains
> >> the failure.
> >> See my question below in the patch:
> >>
> >> On 03/12/2017 09:20 PM, Parav Pandit wrote:
> >>> Hi Ursula,
> >>>
> >>>> -----Original Message-----
> >>>> From: linux-rdma-owner@xxxxxxxxxxxxxxx [mailto:linux-rdma-
> >>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Ursula Braun
> >>>> Sent: Thursday, March 9, 2017 3:54 AM
> >>>> To: Eli Cohen <eli@xxxxxxxxxxxx>; Matan Barak
> <matanb@xxxxxxxxxxxx>
> >>>> Cc: Saeed Mahameed <saeedm@xxxxxxxxxxxx>; Leon Romanovsky
> >>>> <leonro@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
> >>>> Subject: Re: Fwd: mlx5_ib_post_send panic on s390x
> >>>>
> >>>>
> >>>>
> >>>> On 03/06/2017 02:08 PM, Eli Cohen wrote:
> >>>>>>>
> >>>>>>> The problem seems to be caused by the usage of plain memcpy in
> >>>> set_data_inl_seg().
> >>>>>>> The address provided by SMC-code in struct ib_send_wr *wr is an
> >>>>>>> address belonging to an area mapped with the
> ib_dma_map_single()
> >>>>>>> call. On s390x those kind of addresses require extra access
> >>>>>>> functions (see
> >>>> arch/s390/include/asm/io.h).
> >>>>>>>
> >>>>>
> >>>>> By definition, when you are posting a send request with inline,
> >>>>> the address
> >>>> must be mapped to the cpu so plain memcpy should work.
> >>>>>
> >>>> In the past I run SMC-R with Connect X3 cards. The mlx4 driver does
> >>>> not seem to contain extra coding for IB_SEND_INLINE flag for
> >>>> ib_post_send. Does this mean for SMC-R to run on Connect X3 cards
> >>>> the IB_SEND_INLINE flag is ignored, and thus I needed the
> >>>> ib_dma_map_single() call for the area used with ib_post_send()?
> >>>> Does this mean I should stay away from the IB_SEND_INLINE flag, if
> >>>> I want to run the same SMC-R code with both, Connect X3 cards and
> >>>> Connect X4
> >> cards?
> >>>>
> >>> I had encountered the same kernel panic that you mentioned last week
> >>> on
> >> ConnectX-4 adapters with smc-r on x86_64.
> >>> Shall I submit below fix to netdev mailing list?
> >>> I have tested above change. I also have optimization that avoids dma
> >>> mapping
> >> for wr_tx_dma_addr.
> >>>
> >>> -               lnk->wr_tx_sges[i].addr =
> >>> -                       lnk->wr_tx_dma_addr + i * SMC_WR_BUF_SIZE;
> >>> +               lnk->wr_tx_sges[i].addr =
> >>> + (uintptr_t)(lnk->wr_tx_bufs
> >>> + + i);
> >>>
> >>> I also have fix for processing IB_SEND_INLINE in mlx4 driver on
> >>> little older
> >> kernel base.
> >>> I have attached below. I can rebase my kernel and provide fix in mlx5_ib
> driver.
> >>> Let me know.
> >>>
> >>> Regards,
> >>> Parav Pandit
> >>>
> >>> diff --git a/drivers/infiniband/hw/mlx4/qp.c
> >>> b/drivers/infiniband/hw/mlx4/qp.c index a2e4ca5..0d984f5 100644
> >>> --- a/drivers/infiniband/hw/mlx4/qp.c
> >>> +++ b/drivers/infiniband/hw/mlx4/qp.c
> >>> @@ -2748,6 +2748,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp,
> >>> struct
> >> ib_send_wr *wr,
> >>>  	unsigned long flags;
> >>>  	int nreq;
> >>>  	int err = 0;
> >>> +	int inl = 0;
> >>>  	unsigned ind;
> >>>  	int uninitialized_var(stamp);
> >>>  	int uninitialized_var(size);
> >>> @@ -2958,30 +2959,97 @@ int mlx4_ib_post_send(struct ib_qp *ibqp,
> >>> struct
> >> ib_send_wr *wr,
> >>>  		default:
> >>>  			break;
> >>>  		}
> >>> +		if (wr->send_flags & IB_SEND_INLINE && wr->num_sge) {
> >>> +			struct mlx4_wqe_inline_seg *seg;
> >>> +			void *addr;
> >>> +			int len, seg_len;
> >>> +			int num_seg;
> >>> +			int off, to_copy;
> >>> +
> >>> +			inl = 0;
> >>> +
> >>> +			seg = wqe;
> >>> +			wqe += sizeof *seg;
> >>> +			off = ((uintptr_t) wqe) & (MLX4_INLINE_ALIGN - 1);
> >>> +			num_seg = 0;
> >>> +			seg_len = 0;
> >>> +
> >>> +			for (i = 0; i < wr->num_sge; ++i) {
> >>> +				addr = (void *) (uintptr_t) wr->sg_list[i].addr;
> >>> +				len  = wr->sg_list[i].length;
> >>> +				inl += len;
> >>> +
> >>> +				if (inl > 16) {
> >>> +					inl = 0;
> >>> +					err = ENOMEM;
> >>> +					*bad_wr = wr;
> >>> +					goto out;
> >>> +				}
> >> SMC-R fails due to this check. inl is 44 here. Why is 16 a limit for
> >> IB_SEND_INLINE data?
> >> The SMC-R code calls ib_create_qp() with max_inline_data=44. And the
> >> function does not seem to return an error.
> >>>
> > This check should be for max_inline_data variable of the QP.
> > This was just for error check, I should have fixed it. I was testing with nvme
> where inline data was only worth 16 bytes.
> > I will fix this. Is it possible to change to 44 and do quick test?
> > Final patch will have right check in addition to check in create_qp?
> >
> >>> -		/*
> >>> -		 * Write data segments in reverse order, so as to
> >>> -		 * overwrite cacheline stamp last within each
> >>> -		 * cacheline.  This avoids issues with WQE
> >>> -		 * prefetching.
> >>> -		 */
> >>> +				while (len >= MLX4_INLINE_ALIGN - off) {
> With this code there are 2 memcpy-Calls, one with to_copy=44, and the next
> one with len 0.
> I suggest to change the check to "len > MLX4_INLINE_ALIGN - off".
> >>> +					to_copy = MLX4_INLINE_ALIGN - off;
> >>> +					memcpy(wqe, addr, to_copy);
> >>> +					len -= to_copy;
> >>> +					wqe += to_copy;
> >>> +					addr += to_copy;
> >>> +					seg_len += to_copy;
> >>> +					wmb(); /* see comment below */
> >>> +					seg->byte_count =
> >> htonl(MLX4_INLINE_SEG | seg_len);
> >>> +					seg_len = 0;
> >>> +					seg = wqe;
> >>> +					wqe += sizeof *seg;
> >>> +					off = sizeof *seg;
> >>> +					++num_seg;
> >>> +				}
> >>>
> >>> -		dseg = wqe;
> >>> -		dseg += wr->num_sge - 1;
> >>> -		size += wr->num_sge * (sizeof (struct mlx4_wqe_data_seg) /
> >> 16);
> >>> +				memcpy(wqe, addr, len);
> >>> +				wqe += len;
> >>> +				seg_len += len;
> >>> +				off += len;
> >>> +			}
> >>>
> >>> -		/* Add one more inline data segment for ICRC for MLX sends
> */
> >>> -		if (unlikely(qp->mlx4_ib_qp_type == MLX4_IB_QPT_SMI ||
> >>> -			     qp->mlx4_ib_qp_type == MLX4_IB_QPT_GSI ||
> >>> -			     qp->mlx4_ib_qp_type &
> >>> -			     (MLX4_IB_QPT_PROXY_SMI_OWNER |
> >> MLX4_IB_QPT_TUN_SMI_OWNER))) {
> >>> -			set_mlx_icrc_seg(dseg + 1);
> >>> -			size += sizeof (struct mlx4_wqe_data_seg) / 16;
> >>> -		}
> >>> +			if (seg_len) {
> >>> +				++num_seg;
> >>> +				/*
> >>> +				 * Need a barrier here to make sure
> >>> +				 * all the data is visible before the
> >>> +				 * byte_count field is set.  Otherwise
> >>> +				 * the HCA prefetcher could grab the
> >>> +				 * 64-byte chunk with this inline
> >>> +				 * segment and get a valid (!=
> >>> +				 * 0xffffffff) byte count but stale
> >>> +				 * data, and end up sending the wrong
> >>> +				 * data.
> >>> +				 */
> >>> +				wmb();
> >>> +				seg->byte_count = htonl(MLX4_INLINE_SEG
> |
> >> seg_len);
> >>> +			}
> >>>
> >>> -		for (i = wr->num_sge - 1; i >= 0; --i, --dseg)
> >>> -			set_data_seg(dseg, wr->sg_list + i);
> >>> +			size += (inl + num_seg * sizeof (*seg) + 15) / 16;
> >>> +		} else {
> >>> +			/*
> >>> +			 * Write data segments in reverse order, so as to
> >>> +			 * overwrite cacheline stamp last within each
> >>> +			 * cacheline.  This avoids issues with WQE
> >>> +			 * prefetching.
> >>> +			 */
> >>> +
> >>> +			dseg = wqe;
> >>> +			dseg += wr->num_sge - 1;
> >>> +			size += wr->num_sge * (sizeof (struct
> >> mlx4_wqe_data_seg) / 16);
> >>> +
> >>> +			/* Add one more inline data segment for ICRC for
> MLX
> >> sends */
> >>> +			if (unlikely(qp->mlx4_ib_qp_type ==
> MLX4_IB_QPT_SMI
> >> ||
> >>> +				     qp->mlx4_ib_qp_type ==
> MLX4_IB_QPT_GSI
> >> ||
> >>> +				     qp->mlx4_ib_qp_type &
> >>> +				     (MLX4_IB_QPT_PROXY_SMI_OWNER |
> >> MLX4_IB_QPT_TUN_SMI_OWNER))) {
> >>> +				set_mlx_icrc_seg(dseg + 1);
> >>> +				size += sizeof (struct mlx4_wqe_data_seg) /
> 16;
> >>> +			}
> >>>
> >>> +			for (i = wr->num_sge - 1; i >= 0; --i, --dseg)
> >>> +				set_data_seg(dseg, wr->sg_list + i);
> >>> +		}
> >>>  		/*
> >>>  		 * Possibly overwrite stamping in cacheline with LSO
> >>>  		 * segment only after making sure all data segments
> >>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> >>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >>>> majordomo info at http://vger.kernel.org/majordomo-info.html
> >

��.n��������+%������w��{.n�����{���fk��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux