On Fri, Aug 20, 2021 at 09:25:30AM +0200, Daniel Vetter wrote: > On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote: > > > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@xxxxxxxxxx> wrote: > > > > > > > > Hey all, > > > > > > > > Currently, the RDMA subsystem can only work with dynamic dmabuf > > > > attachments, which requires the RDMA device to support on-demand-paging > > > > (ODP) which is not common on most devices (only supported by mlx5). > > > > > > > > While the dynamic requirement makes sense for certain GPUs, some devices > > > > (such as habanalabs) have device memory that is always "pinned" and do > > > > not need/use the move_notify operation. > > > > > > > > The motivation of this RFC is to use habanalabs as the dmabuf exporter, > > > > and EFA as the importer to allow for peer2peer access through libibverbs. > > > > > > > > This draft patch changes the dmabuf driver to differentiate between > > > > static/dynamic attachments by looking at the move_notify op instead of > > > > the importer_ops struct, and allowing the peer2peer flag to be enabled > > > > in case of a static exporter. > > > > > > > > Thanks > > > > > > > > Signed-off-by: Gal Pressman <galpress@xxxxxxxxxx> > > > > > > Given that habanalabs dma-buf support is very firmly in limbo (at > > > least it's not yet in linux-next or anywhere else) I think you want to > > > solve that problem first before we tackle the additional issue of > > > making p2p work without dynamic dma-buf. Without that it just doesn't > > > make a lot of sense really to talk about solutions here. > > > > I have been thinking about adding a dmabuf exporter to VFIO, for > > basically the same reason habana labs wants to do it. > > > > In that situation we'd want to see an approach similar to this as well > > to have a broad usability. > > > > The GPU drivers also want this for certain sophisticated scenarios > > with RDMA, the intree drivers just haven't quite got there yet. > > > > So, I think it is worthwhile to start thinking about this regardless > > of habana labs. > > Oh sure, I've been having these for a while. I think there's two options: > - some kind of soft-pin, where the contract is that we only revoke > when absolutely necessary, and it's expected to be catastrophic on the > importer's side. Honestly, I'm not very keen on this. We don't really have HW support in several RDMA scenarios for even catastrophic unpin. Gal, can EFA even do this for a MR? You basically have to resize the rkey/lkey to zero length (or invalidate it like a FMR) under the catstrophic revoke. The rkey/lkey cannot just be destroyed as that opens a security problem with rkey/lkey re-use. I think I saw EFA's current out of tree implementations had this bug. > to do is mmap revoke), and I think that model of exclusive device > ownership with the option to revoke fits pretty well for at least some > of the accelerators floating around. In that case importers would > never get a move_notify (maybe we should call this revoke_notify to > make it clear it's a bit different) callback, except when the entire > thing has been yanked. I think that would fit pretty well for VFIO, > and I think we should be able to make it work for rdma too as some > kind of auto-deregister. The locking might be fun with both of these > since I expect some inversions compared to the register path, we'll > have to figure these out. It fits semantically nicely, VFIO also has a revoke semantic for BAR mappings. The challenge is the RDMA side which doesn't have a 'dma disabled error state' for objects as part of the spec. Some HW, like mlx5, can implement this for MR objects (see revoke_mr), but I don't know if anything else can, and even mlx5 currently can't do a revoke for any other object type. I don't know how useful it would be, need to check on some of the use cases. The locking is tricky as we have to issue a device command, but that device command cannot run concurrently with destruction or the tail part of creation. Jason