On Mon, Aug 23, 2021 at 5:19 PM Oded Gabbay <ogabbay@xxxxxxxxxx> wrote: > > On Mon, Aug 23, 2021 at 4:04 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > On Mon, Aug 23, 2021 at 11:53:48AM +0300, Oded Gabbay wrote: > > > > > Do you see any issue with that ? > > > > It should work out, without a netdev you have to be more careful about > > addressing and can't really use the IP addressing modes. But you'd > > have a singular hardwired roce gid in this case and act more like an > > IB device than a roce device. > > > > Where you might start to run into trouble is you probably want to put > > all these ports under a single struct ib_device and we've been moving > > away from having significant per-port differences. But I suspect it > > can still work out. > > > > Jason > > ok, thanks for all the info. > I will go look at the efa driver. > > Thanks, > Oded Hi Jason. So it took a *bit* longer than expected due to higher-priority tasks, but in the last month we did a thorough investigation of how our h/w maps to the IBverbs API and it appears we have a few constraints that are not quite common. Tackling these constraints can affect the basic design of the driver or even be a non-starter for this entire endeavor. Therefore, I would like to list the major constraints and get your opinion whether they are significant, and if so, how to tackle them. To understand the context of these constraints, I would like to first say that the Gaudi NICs were designed primarily as a form of a scale-out fabric for doing Deep-Learning training across thousands of Gaudi devices. This means that the designated deployment is one where the entire network is composed of Gaudi NICs, and L2/L3 switches. Doing interoperability with other NICs was not the main goal, although we did manage to work vs. a MLNX RDMA NIC in the lab. In addition, I would like to remind you that each Gaudi has multiple NIC ports, but from our perspective they are all used for the same purpose. i.e. We are using ALL the Gaudi NIC ports for a single user process to distribute its Deep-Learning training workload. Due to that, we would want to put all the ports under a single struct ib_device, as you said it yourself in your original email a year ago. I haven't written this as a h/w constraint, but this is very important for us from a system/deployment perspective. I would go on to say it is pretty much mandatory. The major constraints are: 1. Support only RDMA WRITE operation. We do not support READ, SEND or RECV. This means that many existing open source tests in rdma-core are not compatible. e.g. rc_pingpong.c will not work. I guess we will need to implement different tests and submit them ? Do you have a different idea/suggestion ? 2. As you mentioned in the original email, we support only a single PD. I don't see any major implication regarding this constraint but please correct me if you think otherwise. 3. MR limitation on the rkey that is received from the remote connection during connection creation. The limitation is that our h/w extracts the rkey from the QP h/w context and not from the WQE when sending packets. This means that we may associate only a single remote MR per QP. Moreover, we also have an MR limitation on the rkey that we can give to the remote side. Our h/w extracts the rkey from QP h/w context and not from the received packets. This means we give the same rkey for all MRs that we create per QP. Do you see any issue here with these two limitations ? One thing we noted is that we need to somehow configure the rkey in our h/w QP context, while today the API doesn't allow it. These limitations are not relevant to a deployment where all the NICs are Gaudi NICs, because we can use a single rkey for all MRs. 4. We do not support all the flags in the reg_mr API. e.g. we don't support IBV_ACCESS_LOCAL_WRITE. I'm not sure what the implication is here. 5. Our h/w contains several accelerations we would like to utilize. e.g. we have a h/w mechanism for accelerating collective operations on multiple RDMA NICs. These accelerations will require either extensions to current APIs, or some dedicated APIs. For example, one of the accelerations requires that the user will create a QP with the same index on all the Gaudi NICs. Those are the major constraints. We have a few others but imo they are less severe and can be discussed when we upstream the code. btw, due to the large effort, we will do this conversion only for Gaudi2 (and beyond). Gaudi1 will continue to use our proprietary, not-upstreamed, kernel driver uAPI. Appreciate your help on this. Thanks, Oded