Hey folks, (chiming in very late here...)
I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
- How it can be hidden in a RDMA specific library
So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
== IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
the solution generic across DAX and non-DAX. What's you're feeling for
how well applications are prepared to deal with that status return?
Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.
But why should the kernel ever need to mangle the CQ? if a lease break
would deregister the MR the device is expected to generate remote
protection errors on its own.
And in that case, I think we need a query mechanism rather an event
mechanism so when the application starts seeing protection errors
it can query the relevant MR (I think most if not all devices have that
information in their internal completion queue entries).
So, you need a side channel of some kind, either in certain drivers or
generically..
- How lease break can be done hitlessly, so the library user never
needs to know it is happening or see failed/missed transfers
I agree that the application should not be aware of lease breakages, but
seeing failed transfers is perfectly acceptable given that an access
violation is happening (my assumption is that failed transfers are error
completions reported in the user completion queue). What we need to have
is a framework to help user-space to recover sanely, which is to query
what MR had the access violation, restore it, and re-establish the queue
pair.
iommu redirect should be hit less and behave like the page cache case
where RDMA targets pages that are no longer part of the file.
Yes, if the iommu can be fenced properly it sounds doable.
- Whatever fast path checking is needed does not kill performance
What do you consider a fast path? I was assuming that memory
registration is a slow path, and iommu operations are asynchronous so
should not impact performance of ongoing operations beyond typical
iommu overhead.
ibv_poll_cq() and ibv_post_send() would be a fast path.
Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...
This is exactly why I think we should not have it, but instead give
building blocks to recover sanely from error completions...