On Fri, Aug 28, 2020 at 11:51:07AM -0500, Bob Pearson wrote: > I have been trying to reduce the number of test failures in the > pyverbs tests for the rxe driver. There is one class of these errors > that seems to be potentially a design failure in rdma core. By > default each time a new RoCE device is registered the core sets up a > gid table in cache.c and populates the first gid entry with the > eui64 version of the IPV6 link local address. Later the other IP > addresses configured on each port are added as well. It is expected > that the default entry with sgid_index = 0 will function as a valid > source address. Five years ago this probably always worked but more > modern OSes have stopped using this address for privacy > reasons. Ubuntu 20.04 which is the one I am working on uses a pseudo > random address and not the MAC based one. Windows and IOS also > apparently no longer use this address. The result is that the > pyverbs test cases which use sgid_index = 0 in some cases, and use > random sgid_indices including 0 in others, fail. The most common > failure symptom is that when attempting to add a remote address to a > QP (INIT -> RTR) it is unable to contact the invalid address and it > times out. The RoCEv1 GID is formed as you described above, is rxe triggering some RoCEv1 support that it can't handle? > A better choice for the default GID for RoCEv2 devices may be to > just use the IPV6 address configured as the link local address for > the ndev. If they use the eui64 address the result will be the > same. At least some of these OSes claim that the link local address > is temporary, changing periodically. This would require tracking > IPV6. Certainly RoCEv2 devices shouldn't have GIDs that are not matching their IP addresses. Otherwise it would malform a UDP header. Maybe Parav remebers if there is some tricky reason why this is still being done? Jason