Re: cm_process_routed_req() does not resonate well with RoCE systems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On 23 Mar 2021, at 19:08, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
> On Fri, Mar 19, 2021 at 03:07:07PM +0000, Haakon Bugge wrote:
>> 
>> 
>>> On 18 Mar 2021, at 20:21, Haakon Bugge <haakon.bugge@xxxxxxxxxx> wrote:
>>> 
>>> With the introduction of RoCE systems, a CM REQ message will contain (pasted from Wireshark):
>>> 
>>> Primary Hop Limit: 0x40
>>> .... 0... = Primary Subnet Local: 0x0
>>> 
>>> This because cma_resolve_iboe_route() has:
>>> 
>>>       if (((struct sockaddr *)&id_priv->id.route.addr.dst_addr)->sa_family != AF_IB)
>>>               /* TODO: get the hoplimit from the inet/inet6 device */
>>>               route->path_rec->hop_limit = addr->dev_addr.hoplimit;
>>>       else
>>>               route->path_rec->hop_limit = 1;
>>> 
>>> The addr->dev_addr.hoplimit is coming from addr4_resolve(), which has:
>>> 
>>> 	addr->hoplimit = ip4_dst_hoplimit(&rt->dst);
>>> 
>>> ip4_dst_hoplimit() returns the value of the sysctl net.ipv4.ip_default_ttl in this case (64).
>>> 
>>> For the purpose of this case, consider the CM REQ to have the Primary SL != 0.
>>> 
>>> When this REQ gets processed by cm_req_handler(), the cm_process_routed_req() function is called.
>>> 
>>> Since the Primary Subnet Local value is zero in the request, and since this is RoCE (Primary Local LID is permissive), the following statement will be executed:
>>> 
>>> 	IBA_SET(CM_REQ_PRIMARY_SL, req_msg, wc->sl);
>>> 
>>> At least on the system I ran on, which was equipped with a
>>> Mellanox CX-5 HCA, the wc->sl is zero. Hence, the request to setup
>>> a connection using an SL != zero, will not be honoured, and a
>>> connection using SL zero will be created instead.
>>> 
>>> As a side note, in cm_process_routed_req(), we have:
>>> 
>>> 	IBA_SET(CM_REQ_PRIMARY_REMOTE_PORT_LID, req_msg, wc->dlid_path_bits);
>>> 
>>> which is strange, since a LID is 16 bits, whereas dlid_path_bits is only eight.
>>> 
>>> I am uncertain about the correct fix here. Any input appreciated.
>> 
>> I intend to send a patch doing:
>> 
>> +++ b/drivers/infiniband/core/cm.c
>> @@ -2138,7 +2138,8 @@ static int cm_req_handler(struct cm_work *work)
>>                goto destroy;
>>        }
>> 
>> -       cm_process_routed_req(req_msg, work->mad_recv_wc->wc);
>> +       if (cm_id_priv->av.ah_attr.type != RDMA_AH_ATTR_TYPE_ROCE)
>> +               cm_process_routed_req(req_msg, work->mad_recv_wc->wc);
>> 
>>        memset(&work->path[0], 0, sizeof(work->path[0]));
>>        if (cm_req_has_alt_path(req_msg))
>>> 
>> if I do not get a push back.
> 
> This does seem reasonable, but I don't understand the underlying
> issue, why is anything in roce land touching the SL? Are you trying to
> use the SL as a proxy for the TOS bits?

We want to control the DSCP in the encapsulating IP packet to select different TCs. As per the RoCE Annex:

<quote>
The SL component in the Address Vector is used to determine the Ethernet Priority of generated RoCEv2 packets. SL 0-7 are mapped directly to Priorities 0-7, respectively. SL 8-15 are reserved.
</quote>

Quite similar to how IB Link-layer translates the SL to an VL.



Thxs, Håkon







[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux