Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver

Oded Gabbay <oded.gabbay@xxxxxxxxx> · Fri, 18 Sep 2020 16:49:25 +0300

On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
>
> > The problem with MR is that the API doesn't let us return a new VA. It
> > forces us to use the original VA that the Host OS allocated.
>
> If using the common MR API you'd have to assign a unique linear range
> in the single device address map and record both the IOVA and the MMU
> VA in the kernel struct.
>
> Then when submitting work using that MR lkey the kernel will adjust
> the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> forwarding to HW.
>
We can't do that. That will kill the performance. If for every
submission I need to modify the packet's contents, the throughput will
go downhill.
Also, submissions to our RDMA qmans are coupled with submissions to
our DMA/Compute QMANs. We can't separate those to different API calls.
That will also kill performance and in addition, will prevent us from
synchronizing all the engines.

I also have to say, it troubles me that you keep referring to our
device as an RDMA device. It is not an RDMA device. It is a
deep-learning accelerator which uses RDMA as a way to interconnect
multiple devices. We don't intend to replace General-Purpose RDMA
devices. We know we don't support that.
Therefore, I still fail to see why we need to support all the above...

Our work submission is not to just "send/receive packets". Sending
packets is part of a general recipe to do DMA, perform compute on data
and send/receive data. All together, in a synchronized fashion.

The way you try to force me to go is to separate that into different
functionality, as if I have different ASICs, which is very
counter-productive in terms of performance and simplicity. i.e. have
one method of submitting work to DMA/compute and another way to RDMA
ports.

I know this is how the kernel is structured now - subsystems for
devices that belong to a single domain (graphics, net, storage). But I
fear that you will soon see this paradigm doesn't work with new
devices in AI, which combine multiple domains into a single ASIC.

Greg, I would love to hear your opinion here. Am I totally wrong ? Is
treating a single ASIC that belongs to multiple domains as if it were
multiple ASICs a good thing ? Don't you think it will hurt the
performance ?

Oded

> EFA doesn't support rkeys, so they are not required to be emulated. It
> would have to create rkeys using some guadidv_reg_mr_rkey()
>
> It is important to understand that the usual way we support these
> non-RDMA devices is to insist that they use SW to construct a minimal
> standards based RDMA API, and then allow the device to have a 'dv' API
> to access a faster, highly device specific, SW bypass path.
>
> So for instance you might have some guadidv_post_work(qp) that doesn't
> use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr)
> would return the required HW VA from the kernel.
>
> Usually the higher level communication library (UCX, MPI, etc) forms
> the dv primitives into something application usable.
>
> > we do if that VA is in the range of our HBM addresses ? The device
> > won't be able to distinguish between them. The transaction that is
> > generated by an engine inside our device will go to the HBM instead of
> > going to the PCI controller and then to the host.
> >
> > That's the crust of the problem and why we didn't use MR.
>
> No, the problem with the device is that it doesn't have a lkey/rkey,
> so it is stuck with a single translation domain. RoCE compliant
> devices are required to have multiple translation domains - each
> lkey/rkey specifies a unique translation.
>
> The MR concept is a region of process VA mapped into the device for
> device access, and this device *clearly* has that.
>
> Jason