RE: [PATCH RFC 0/9] A rendezvous module

"Rimmer, Todd" <todd.rimmer@xxxxxxxxx> · Fri, 19 Mar 2021 22:57:20 +0000

> > [Wan, Kaike] Incorrect. The rv module works with hfi1.
> 
> Interesting. I was thinking the opposite. So what's the benefit? When would
> someone want to do that?
The more interesting scenario is for customers who would like to run libfabric and other Open Fabrics Alliance software over various verbs capable hardware.
Today PSM2 is a good choice for OPA hardware.  However for some other devices without existing libfabric providers, rxm and rxd are the best choices.
As was presented in Open Fabrics workshop today by James Erwin, PSM3 offers noticeable benefits over existing libfabric rxm and rxd providers
and the rv module offers noticeable performance benefits when using PSM3.

> This driver is intended to work with a fork of the PSM2 library. The
> PSM2 library which is for Omni-Path is now maintained by Cornelis
> Networks on our GitHub. PSM3 is something from Intel for Ethernet. I
> know it's a bit confusing.
Intel retains various IP and trademark rights.  Intel's marketing team analyzed and chose the name PSM3.  Obviously plusses and minuses to any name choice.

This is not unlike other industry software history where new major revisions often add and remove support for various HW generations.
PSM(1) - supported infinipath IB adapters, was a standalone API (various forms).
PSM2 - dropped support for infinipath and IB and added support for Omni-Path, along with various features, also added libfabric support
PSM3 - dropped support for Omni-Path, added support for RoCE and verbs capable devices, along with other features,
	also dropped PSM2 API and standardized on libfabric.
All three have similar strategies of onload protocols for eager messages and shared kernel/HW resources for large messages
and direct data placement (RDMA).  So the name Performance Scaled Messaging is meant to reflect the concept and approach
as opposed to reflecting a specific HW implementation or even API.

PSM3 is only available as a libfabric provider.

> I haven't had a chance to look beyond the cover letter in depth at how things
> have changed. I really hope it's not that bad.
While a few stylistic elements got carried forward, as you noticed.  This is much different from hfi1 as it doesn't directly access hardware and is hence smaller.
We carefully looked at overlap with features in ib_core and the patch set contains a couple minor API additions to ib_core to simplify some operations
which others may find useful.

> I also don't know why you picked the name rv, this looks like it has little to do with the usual MPI rendezvous protocol.
The focus of the design was to support the bulk transfer part of the MPI rendezvous protocol, hence the name rv.
We'd welcome other name suggestions, wanted to keep the name simple and brief.

> No pre-adding reserved stuff
> Lots of alignment holes, don't do that either.
We'd like advise on a challenging situation.  Some customers desire NICs to support nVidia GPUs in some environments.
Unfortunately the nVidia GPU drivers are not upstream, and have not been for years.  So we are forced to have both out of tree
and upstream versions of the code.  We need the same applications to be able to work over both, so we would like the
GPU enabled versions of the code to have the same ABI as the upstream code as this greatly simplifies things.
We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved"
mechanisms to hold places for GPU specific fields which can't be upstreamed.

Todd