On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > The problem with MR is that the API doesn't let us return a new VA. It > > forces us to use the original VA that the Host OS allocated. > > If using the common MR API you'd have to assign a unique linear range > in the single device address map and record both the IOVA and the MMU > VA in the kernel struct. > > Then when submitting work using that MR lkey the kernel will adjust > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > forwarding to HW. > We can't do that. That will kill the performance. If for every submission I need to modify the packet's contents, the throughput will go downhill. Also, submissions to our RDMA qmans are coupled with submissions to our DMA/Compute QMANs. We can't separate those to different API calls. That will also kill performance and in addition, will prevent us from synchronizing all the engines. I also have to say, it troubles me that you keep referring to our device as an RDMA device. It is not an RDMA device. It is a deep-learning accelerator which uses RDMA as a way to interconnect multiple devices. We don't intend to replace General-Purpose RDMA devices. We know we don't support that. Therefore, I still fail to see why we need to support all the above... Our work submission is not to just "send/receive packets". Sending packets is part of a general recipe to do DMA, perform compute on data and send/receive data. All together, in a synchronized fashion. The way you try to force me to go is to separate that into different functionality, as if I have different ASICs, which is very counter-productive in terms of performance and simplicity. i.e. have one method of submitting work to DMA/compute and another way to RDMA ports. I know this is how the kernel is structured now - subsystems for devices that belong to a single domain (graphics, net, storage). But I fear that you will soon see this paradigm doesn't work with new devices in AI, which combine multiple domains into a single ASIC. Greg, I would love to hear your opinion here. Am I totally wrong ? Is treating a single ASIC that belongs to multiple domains as if it were multiple ASICs a good thing ? Don't you think it will hurt the performance ? Oded > EFA doesn't support rkeys, so they are not required to be emulated. It > would have to create rkeys using some guadidv_reg_mr_rkey() > > It is important to understand that the usual way we support these > non-RDMA devices is to insist that they use SW to construct a minimal > standards based RDMA API, and then allow the device to have a 'dv' API > to access a faster, highly device specific, SW bypass path. > > So for instance you might have some guadidv_post_work(qp) that doesn't > use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr) > would return the required HW VA from the kernel. > > Usually the higher level communication library (UCX, MPI, etc) forms > the dv primitives into something application usable. > > > we do if that VA is in the range of our HBM addresses ? The device > > won't be able to distinguish between them. The transaction that is > > generated by an engine inside our device will go to the HBM instead of > > going to the PCI controller and then to the host. > > > > That's the crust of the problem and why we didn't use MR. > > No, the problem with the device is that it doesn't have a lkey/rkey, > so it is stuck with a single translation domain. RoCE compliant > devices are required to have multiple translation domains - each > lkey/rkey specifies a unique translation. > > The MR concept is a region of process VA mapped into the device for > device access, and this device *clearly* has that. > > Jason