> -----Original Message----- > From: Pearson, Robert B <robert.pearson2@xxxxxxx> > Sent: Thursday, 10 February 2022 06:13 > To: Christian Blume <chr.blume@xxxxxxxxx>; RDMA mailing list <linux- > rdma@xxxxxxxxxxxxxxx> > Subject: [EXTERNAL] RE: Soft-RoCE performance > > Christian, > > There are two key differences between TCP and soft RoCE. Most importantly > TCP can use a 64KiB MTU which is fragmented by TSO or GSO if your NIC > doesn't support TSO while soft RoCE is limited by the protocol to a 4KiB > payload. With overhead for headers you need a link MTU of about 4096+256. > If your application is going between soft RoCE and hard RoCE you have to > live with this limit and compute ICRC on each packet. Checking is optional > since RoCE packets have a crc32 checksum from ethernet. If you are using > soft RoCE to soft RoCE you can ignore both ICRC calculations and with a > patch increase the MTU above 4KiB. I have measured write performance up to > around 35 GB/s in local loopback on a single 12 core box (AMD 3900x) using > 12 IO threads, 16KB MTU, and ICRC disabled for 1MB messages. This is on > head of tree with some patches not yet upstream. > > Bob Pearson > rpearsonhpe@xxxxxxxxx > rpearson@xxxxxxx > > > -----Original Message----- > From: Christian Blume <chr.blume@xxxxxxxxx> > Sent: Wednesday, February 9, 2022 9:34 PM > To: RDMA mailing list <linux-rdma@xxxxxxxxxxxxxxx> > Subject: Soft-RoCE performance > > Hello! > > I am seeing that Soft-RoCE has much lower throughput than TCP. Is that > expected? If not, are there typical config parameters I can fiddle with? > > When running iperf I am getting around 300MB/s whereas it's only around > 100MB/s using ib_write_bw from perftests. > > This is between two machines running Ubuntu20.04 with the 5.11 kernel. > > Cheers, > Christian It reminds me of a discussion we had a while ago - see https://patchwork.kernel.org/project/linux-rdma/patch/20200414144822.2365-1-bmt@xxxxxxxxxxxxxx/ Running on TCP and implementing iWarp, siw suffers the same problem. Maybe it makes sense looking into a generic solution to the problem for software based RDMA implementations, potentially using the given RDMA core infrastructure? Back then, we proposed using a spare protocol bit to do GSO signaling. Krishna extended that idea to an MTU size negotiation using multiple spare bits. Another idea was to use the rdma netlink protocol for doing those settings. That may also cover toggling CRC calculation. iWarp allows for that negotiation, but there is no API. Control could be provided per interface, or per QP ID, or both (I'd prefer). With the rxe driver coming up with a similar thing, I tend to prefer such a generic solution, even if it further complicates common man's RDMA usage. What do other think? Thanks, Bernard.