On Mon, Dec 09, 2019 at 02:07:06PM -0500, Doug Ledford wrote: > On Tue, 2019-12-03 at 19:46 -0500, Doug Ledford wrote: > > On Tue, 2019-12-03 at 08:25 -0800, Bart Van Assche wrote: > > > On 12/2/19 6:03 PM, Steve Wise wrote: > > > > If RoCE PDUs being sent or received contain pad bytes, then the > > > > iCRC > > > > is miscalculated resulting PDUs being emitted by RXE with an > > > > incorrect > > > > iCRC, as well as ingress PDUs being dropped due to erroneously > > > > detecting > > > > a bad iCRC in the PDU. The fix is to include the pad bytes, if > > > > any, > > > > in iCRC computations. > > > > > > Should this description mention that this patch breaks > > > compatibility > > > with SoftRoCE drivers that do not include this fix? Do we need a > > > kernel > > > module parameter that allows to select either the old or the new > > > behavior? > > > > No. The original soft-RoCE driver was supposed to be compatible with > > hardware devices. Because of this bug, it obviously wasn't. This is > > a > > bug fix, and we do not need to do anything to be compatible with the > > broken behavior. Instead, it just needs noting that the soft-RoCE > > implementation in prior kernels has a known wire format bug that > > impacts > > communications with both fixed versions of the driver and real > > hardware > > devices. > > I've taken these two patches into for-rc (with fixups to the commit > message on the second, as well as adding a Fixes: tag on the second). > > I stand by what I said about not needing a compatibility flag or module > option for the user to set. However, that isn't to say that we can't > autodetect old soft-RoCE peers. If we get a packet that fails CRC and > has pad bytes, then re-run the CRC without the pad bytes and see if it > matches. If it does, we could A) mark the current QP as being to an old > soft-RoCE device (causing us to send without including the pad bytes in > the CRC) and B) allocate a struct old_soft_roce_peer and save the guid > into that struct and then put that struct on a list that we then search > any time we are creating a new queue pair and if the new queue pair goes > to a guid in the list, then we immediately flag that qp as being to an > old soft roce device and get the right behavior. It would slow down qp > creation somewhat due to the list search, but probably not enough to > worry about. No one will be doing a 1,000 node cluster MPI job over > soft-RoCE, so we should never notice the list length causing search > problems. A patch to do something like that would be welcome. Do you find this implementation needed? I see RXE as a development platform and in my view it is unlikely that someone will run RXE in production with mixture of different kernel versions, which requires such compatibility fallback. Thanks > > -- > Doug Ledford <dledford@xxxxxxxxxx> > GPG KeyID: B826A3330E572FDD > Fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD