> On 8 Jul 2021, at 20:52, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > On Thu, Jul 08, 2021 at 03:59:25PM +0000, Haakon Bugge wrote: >> >> >>> On 5 Jul 2021, at 18:59, Haakon Bugge <haakon.bugge@xxxxxxxxxx> wrote: >>> >>> >>> >>>> On 5 Jul 2021, at 18:26, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: >>>> >>>> On Tue, Jun 29, 2021 at 01:45:35PM +0000, Haakon Bugge wrote: >>>> >>>>>>>> IMHO it is a bug on the sender side to send GMPs to use a pkey that >>>>>>>> doesn't exactly match the data path pkey. >>>>>>> >>>>>>> The active connector calls ib_addr_get_pkey(). This function >>>>>>> extracts the pkey from byte 8/9 in the device's bcast >>>>>>> address. However, RFC 4391 explicitly states: >>>>>> >>>>>> pkeys in CM come only from path records that the SM returns, the above >>>>>> should only be used to feed into a path record query which could then >>>>>> return back a limited pkey. >>>>>> >>>>>> Everything thereafter should use the SM's version of the pkey. >>>>> >>>>> Revisiting this. I think I mis-interpreted the scenario that led to >>>>> the P_Key mismatch messages. >>>>> >>>>> The CM retrieves the pkey_index that matched the P_Key in the BTH >>>>> (cm_get_bth_pkey()) and thereafter calls ib_get_cached_pkey() to get >>>>> the P_Key value of the particular pkey_index. >>>>> >>>>> Assume a full-member sends a REQ. In that case, both P_Keys (BTH and >>>>> primary path_rec) are full. Further, assume the recipient is only a >>>>> limited member. Since full and limited members of the same partition >>>>> are eligible to communicate, the P_Key retrieved by >>>>> cm_get_bth_pkey() will be the limited one. >>>> >>>> It is incorrect for the issuer of the REQ to put a full pkey in the >>>> REQ message when the target is a limited member. >>> >>> Sorry, I mis-interpreted the spec. I though the PKey in the Path record should be that of the initiator, not the target's. OK. Will come up with a fix. >> >> On the systems I have access to (running Oracle flavour OpenSM in >> our NM2 switches), the behaviour is exactly the opposite of what you >> say. > > Check with saquery what is happening, if you request a reversible path > from the CM target (limited pkey) to the CM client (full) you should > get the limited pkey or the SM is broken. > > If the SM is working then probably something in the stack is using a > reversed src/dest when doing the PR query. > > It is not intuitive but the PR query should have SGID as the CM Target > even though it is running on the CM Client. That is not how it is today. And because of that, all accesses to the PR assume the d{gid,lid} is the remote peer. To fix this, I have to swap dgid/sgid and ib.dlid/ib.slid all over to get this working. That is pervasive. E.g., even includes ipoib. Let me know if that is what you want. Thxs, Håkon > > This is because the REQ is supposed to contain a path that is relative > to the target. > > Everything will be the same except for this small detail about > full/limited pkeys. > > The client can figure out what to do with its own pkey table locally. > >> "the P_Key table entry (0x1234) matching incoming BTH.P_Key differs from primary path P_Key (0x9234)" > > "The REQ contains a PKey (0x1234) that is not found in this device's > PKey table. Using alternative limited Pkey (0x9234) instead. This is a > client bug" > > Jason