Re: Fwd: mlx5_ib_post_send panic on s390x

Ursula Braun <ubraun@xxxxxxxxxxxxxxxxxx> · Thu, 16 Mar 2017 12:51:21 +0100

Hi Parav,

I run your new mlx4-Code together with changed SMC-R code no longer mapping 
the IB_SEND_INLINE area. It worked - great!

Below I have added a small improvement idea in your patch.

Nevertheless I am still not sure, if I should keep the IB_SEND_INLINE flag
in the SMC-R code, since there is no guarantee that this will work with
all kinds of RoCE-devices. The maximum length for IB_SEND_INLINE depends
on the RoCE-driver - right? Is there an interface to determine such a
maximum length? Would ib_create_qp() return with an error, if the
SMC-R specified .cap.max_inline_data = 44 is not supported by a RoCE-driver?

On 03/14/2017 04:24 PM, Parav Pandit wrote:
> Hi Ursula,
> 
> 
>> -----Original Message-----
>> From: Ursula Braun [mailto:ubraun@xxxxxxxxxxxxxxxxxx]
>> Sent: Tuesday, March 14, 2017 10:02 AM
>> To: Parav Pandit <parav@xxxxxxxxxxxx>; Eli Cohen <eli@xxxxxxxxxxxx>;
>> Matan Barak <matanb@xxxxxxxxxxxx>
>> Cc: Saeed Mahameed <saeedm@xxxxxxxxxxxx>; Leon Romanovsky
>> <leonro@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
>> Subject: Re: Fwd: mlx5_ib_post_send panic on s390x
>>
>> Hi Parav,
>>
>> I tried your mlx4-patch together with SMC on s390x, but it failed.
>> The SMC-R code tries to send 44 bytes as inline in 1 sge.
>> I wonder about a length check with 16 bytes, which probably explains the
>> failure.
>> See my question below in the patch:
>>
>> On 03/12/2017 09:20 PM, Parav Pandit wrote:
>>> Hi Ursula,
>>>
>>>> -----Original Message-----
>>>> From: linux-rdma-owner@xxxxxxxxxxxxxxx [mailto:linux-rdma-
>>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Ursula Braun
>>>> Sent: Thursday, March 9, 2017 3:54 AM
>>>> To: Eli Cohen <eli@xxxxxxxxxxxx>; Matan Barak <matanb@xxxxxxxxxxxx>
>>>> Cc: Saeed Mahameed <saeedm@xxxxxxxxxxxx>; Leon Romanovsky
>>>> <leonro@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
>>>> Subject: Re: Fwd: mlx5_ib_post_send panic on s390x
>>>>
>>>>
>>>>
>>>> On 03/06/2017 02:08 PM, Eli Cohen wrote:
>>>>>>>
>>>>>>> The problem seems to be caused by the usage of plain memcpy in
>>>> set_data_inl_seg().
>>>>>>> The address provided by SMC-code in struct ib_send_wr *wr is an
>>>>>>> address belonging to an area mapped with the ib_dma_map_single()
>>>>>>> call. On s390x those kind of addresses require extra access
>>>>>>> functions (see
>>>> arch/s390/include/asm/io.h).
>>>>>>>
>>>>>
>>>>> By definition, when you are posting a send request with inline, the
>>>>> address
>>>> must be mapped to the cpu so plain memcpy should work.
>>>>>
>>>> In the past I run SMC-R with Connect X3 cards. The mlx4 driver does
>>>> not seem to contain extra coding for IB_SEND_INLINE flag for
>>>> ib_post_send. Does this mean for SMC-R to run on Connect X3 cards the
>>>> IB_SEND_INLINE flag is ignored, and thus I needed the
>>>> ib_dma_map_single() call for the area used with ib_post_send()? Does
>>>> this mean I should stay away from the IB_SEND_INLINE flag, if I want
>>>> to run the same SMC-R code with both, Connect X3 cards and Connect X4
>> cards?
>>>>
>>> I had encountered the same kernel panic that you mentioned last week on
>> ConnectX-4 adapters with smc-r on x86_64.
>>> Shall I submit below fix to netdev mailing list?
>>> I have tested above change. I also have optimization that avoids dma mapping
>> for wr_tx_dma_addr.
>>>
>>> -               lnk->wr_tx_sges[i].addr =
>>> -                       lnk->wr_tx_dma_addr + i * SMC_WR_BUF_SIZE;
>>> +               lnk->wr_tx_sges[i].addr = (uintptr_t)(lnk->wr_tx_bufs
>>> + + i);
>>>
>>> I also have fix for processing IB_SEND_INLINE in mlx4 driver on little older
>> kernel base.
>>> I have attached below. I can rebase my kernel and provide fix in mlx5_ib driver.
>>> Let me know.
>>>
>>> Regards,
>>> Parav Pandit
>>>
>>> diff --git a/drivers/infiniband/hw/mlx4/qp.c
>>> b/drivers/infiniband/hw/mlx4/qp.c index a2e4ca5..0d984f5 100644
>>> --- a/drivers/infiniband/hw/mlx4/qp.c
>>> +++ b/drivers/infiniband/hw/mlx4/qp.c
>>> @@ -2748,6 +2748,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct
>> ib_send_wr *wr,
>>>  	unsigned long flags;
>>>  	int nreq;
>>>  	int err = 0;
>>> +	int inl = 0;
>>>  	unsigned ind;
>>>  	int uninitialized_var(stamp);
>>>  	int uninitialized_var(size);
>>> @@ -2958,30 +2959,97 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct
>> ib_send_wr *wr,
>>>  		default:
>>>  			break;
>>>  		}
>>> +		if (wr->send_flags & IB_SEND_INLINE && wr->num_sge) {
>>> +			struct mlx4_wqe_inline_seg *seg;
>>> +			void *addr;
>>> +			int len, seg_len;
>>> +			int num_seg;
>>> +			int off, to_copy;
>>> +
>>> +			inl = 0;
>>> +
>>> +			seg = wqe;
>>> +			wqe += sizeof *seg;
>>> +			off = ((uintptr_t) wqe) & (MLX4_INLINE_ALIGN - 1);
>>> +			num_seg = 0;
>>> +			seg_len = 0;
>>> +
>>> +			for (i = 0; i < wr->num_sge; ++i) {
>>> +				addr = (void *) (uintptr_t) wr->sg_list[i].addr;
>>> +				len  = wr->sg_list[i].length;
>>> +				inl += len;
>>> +
>>> +				if (inl > 16) {
>>> +					inl = 0;
>>> +					err = ENOMEM;
>>> +					*bad_wr = wr;
>>> +					goto out;
>>> +				}
>> SMC-R fails due to this check. inl is 44 here. Why is 16 a limit for
>> IB_SEND_INLINE data?
>> The SMC-R code calls ib_create_qp() with max_inline_data=44. And the function
>> does not seem to return an error.
>>>
> This check should be for max_inline_data variable of the QP.
> This was just for error check, I should have fixed it. I was testing with nvme where inline data was only worth 16 bytes.
> I will fix this. Is it possible to change to 44 and do quick test?
> Final patch will have right check in addition to check in create_qp?
> 
>>> -		/*
>>> -		 * Write data segments in reverse order, so as to
>>> -		 * overwrite cacheline stamp last within each
>>> -		 * cacheline.  This avoids issues with WQE
>>> -		 * prefetching.
>>> -		 */
>>> +				while (len >= MLX4_INLINE_ALIGN - off) {
With this code there are 2 memcpy-Calls, one with to_copy=44, and the next one with len 0.
I suggest to change the check to "len > MLX4_INLINE_ALIGN - off".
>>> +					to_copy = MLX4_INLINE_ALIGN - off;
>>> +					memcpy(wqe, addr, to_copy);
>>> +					len -= to_copy;
>>> +					wqe += to_copy;
>>> +					addr += to_copy;
>>> +					seg_len += to_copy;
>>> +					wmb(); /* see comment below */
>>> +					seg->byte_count =
>> htonl(MLX4_INLINE_SEG | seg_len);
>>> +					seg_len = 0;
>>> +					seg = wqe;
>>> +					wqe += sizeof *seg;
>>> +					off = sizeof *seg;
>>> +					++num_seg;
>>> +				}
>>>
>>> -		dseg = wqe;
>>> -		dseg += wr->num_sge - 1;
>>> -		size += wr->num_sge * (sizeof (struct mlx4_wqe_data_seg) /
>> 16);
>>> +				memcpy(wqe, addr, len);
>>> +				wqe += len;
>>> +				seg_len += len;
>>> +				off += len;
>>> +			}
>>>
>>> -		/* Add one more inline data segment for ICRC for MLX sends */
>>> -		if (unlikely(qp->mlx4_ib_qp_type == MLX4_IB_QPT_SMI ||
>>> -			     qp->mlx4_ib_qp_type == MLX4_IB_QPT_GSI ||
>>> -			     qp->mlx4_ib_qp_type &
>>> -			     (MLX4_IB_QPT_PROXY_SMI_OWNER |
>> MLX4_IB_QPT_TUN_SMI_OWNER))) {
>>> -			set_mlx_icrc_seg(dseg + 1);
>>> -			size += sizeof (struct mlx4_wqe_data_seg) / 16;
>>> -		}
>>> +			if (seg_len) {
>>> +				++num_seg;
>>> +				/*
>>> +				 * Need a barrier here to make sure
>>> +				 * all the data is visible before the
>>> +				 * byte_count field is set.  Otherwise
>>> +				 * the HCA prefetcher could grab the
>>> +				 * 64-byte chunk with this inline
>>> +				 * segment and get a valid (!=
>>> +				 * 0xffffffff) byte count but stale
>>> +				 * data, and end up sending the wrong
>>> +				 * data.
>>> +				 */
>>> +				wmb();
>>> +				seg->byte_count = htonl(MLX4_INLINE_SEG |
>> seg_len);
>>> +			}
>>>
>>> -		for (i = wr->num_sge - 1; i >= 0; --i, --dseg)
>>> -			set_data_seg(dseg, wr->sg_list + i);
>>> +			size += (inl + num_seg * sizeof (*seg) + 15) / 16;
>>> +		} else {
>>> +			/*
>>> +			 * Write data segments in reverse order, so as to
>>> +			 * overwrite cacheline stamp last within each
>>> +			 * cacheline.  This avoids issues with WQE
>>> +			 * prefetching.
>>> +			 */
>>> +
>>> +			dseg = wqe;
>>> +			dseg += wr->num_sge - 1;
>>> +			size += wr->num_sge * (sizeof (struct
>> mlx4_wqe_data_seg) / 16);
>>> +
>>> +			/* Add one more inline data segment for ICRC for MLX
>> sends */
>>> +			if (unlikely(qp->mlx4_ib_qp_type == MLX4_IB_QPT_SMI
>> ||
>>> +				     qp->mlx4_ib_qp_type == MLX4_IB_QPT_GSI
>> ||
>>> +				     qp->mlx4_ib_qp_type &
>>> +				     (MLX4_IB_QPT_PROXY_SMI_OWNER |
>> MLX4_IB_QPT_TUN_SMI_OWNER))) {
>>> +				set_mlx_icrc_seg(dseg + 1);
>>> +				size += sizeof (struct mlx4_wqe_data_seg) / 16;
>>> +			}
>>>
>>> +			for (i = wr->num_sge - 1; i >= 0; --i, --dseg)
>>> +				set_data_seg(dseg, wr->sg_list + i);
>>> +		}
>>>  		/*
>>>  		 * Possibly overwrite stamping in cacheline with LSO
>>>  		 * segment only after making sure all data segments
>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html