On Fri, Jul 09, 2021 at 04:45:21PM +0000, Haakon Bugge wrote: > > > > On 8 Jul 2021, at 20:52, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > On Thu, Jul 08, 2021 at 03:59:25PM +0000, Haakon Bugge wrote: > >> > >> > >>> On 5 Jul 2021, at 18:59, Haakon Bugge <haakon.bugge@xxxxxxxxxx> wrote: > >>> > >>> > >>> > >>>> On 5 Jul 2021, at 18:26, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > >>>> > >>>> On Tue, Jun 29, 2021 at 01:45:35PM +0000, Haakon Bugge wrote: > >>>> > >>>>>>>> IMHO it is a bug on the sender side to send GMPs to use a pkey that > >>>>>>>> doesn't exactly match the data path pkey. > >>>>>>> > >>>>>>> The active connector calls ib_addr_get_pkey(). This function > >>>>>>> extracts the pkey from byte 8/9 in the device's bcast > >>>>>>> address. However, RFC 4391 explicitly states: > >>>>>> > >>>>>> pkeys in CM come only from path records that the SM returns, the above > >>>>>> should only be used to feed into a path record query which could then > >>>>>> return back a limited pkey. > >>>>>> > >>>>>> Everything thereafter should use the SM's version of the pkey. > >>>>> > >>>>> Revisiting this. I think I mis-interpreted the scenario that led to > >>>>> the P_Key mismatch messages. > >>>>> > >>>>> The CM retrieves the pkey_index that matched the P_Key in the BTH > >>>>> (cm_get_bth_pkey()) and thereafter calls ib_get_cached_pkey() to get > >>>>> the P_Key value of the particular pkey_index. > >>>>> > >>>>> Assume a full-member sends a REQ. In that case, both P_Keys (BTH and > >>>>> primary path_rec) are full. Further, assume the recipient is only a > >>>>> limited member. Since full and limited members of the same partition > >>>>> are eligible to communicate, the P_Key retrieved by > >>>>> cm_get_bth_pkey() will be the limited one. > >>>> > >>>> It is incorrect for the issuer of the REQ to put a full pkey in the > >>>> REQ message when the target is a limited member. > >>> > >>> Sorry, I mis-interpreted the spec. I though the PKey in the Path record should be that of the initiator, not the target's. OK. Will come up with a fix. > >> > >> On the systems I have access to (running Oracle flavour OpenSM in > >> our NM2 switches), the behaviour is exactly the opposite of what you > >> say. > > > > Check with saquery what is happening, if you request a reversible path > > from the CM target (limited pkey) to the CM client (full) you should > > get the limited pkey or the SM is broken. > > > > If the SM is working then probably something in the stack is using a > > reversed src/dest when doing the PR query. > > > > It is not intuitive but the PR query should have SGID as the CM Target > > even though it is running on the CM Client. > > That is not how it is today. And because of that, all accesses to > the PR assume the d{gid,lid} is the remote peer. To fix this, I have > to swap dgid/sgid and ib.dlid/ib.slid all over to get this > working. That is pervasive. E.g., even includes ipoib. Let me know > if that is what you want. It is only things that use the paths to generate CM REQ messages, and yes it is the right thing to do. Jason