kernel memory registration (was: RDMA/core: Transport-independent access flags)

Sagi Grimberg <sagig@xxxxxxxxxxxxxxxxxx> · Fri, 10 Jul 2015 11:55:29 +0300

On 7/9/2015 8:01 PM, Jason Gunthorpe wrote:
On Thu, Jul 09, 2015 at 02:02:03PM +0300, Sagi Grimberg wrote:

We have protocol that involves remote memory keys transfer in their
standards so I don't see how we can remove it altogether from ULPs.

This is why I've been talking about local and remote MRs
differently.

IMHO, memory registration is memory registration. The fact that we are
distinguishing between local and remote might be a sign that this might
be a wrong direction to take. Sorry.

Besides, if a ULP wants to register memory for local access why should
we temper with that or deny it?

What if a ULP has a pre-allocated pool of large buffers that it knows
it is going to use for its entire lifetime? silent driver driven FRWRs
would perform a lot worse than pre-registering these buffers.

Or what if the ULP wants to register the memory region with data
integrity (signature) parameters?

I must say, whenever I find myself trying to assume/guess what the
ULP/APP might do from the driver PoV and try to see if I'm covered,
I shake my head saying:
"This is a hack, go drink some water and rethink the whole thing".

If there is one thing worse than a complicated API, it is a restrictive
one. I'd much rather ULPs just having a simple API for registering
memory.

A Local MR is one where the Key is never put on the wire,
it exists soley to facilitate DMA between the CPU and the local HCA,
and it would never be needed if we had infinitely long S/G lists.

My main problem with this approach is that once you do non-trivial
things such as memory registration completely under the hood, it is
a slippery slope for device drivers.

Yes, there is going to be some stuff going on, but the simplification
for the ULP side is incredible, it is certainly something that should
be explored and not dismissed without some really good reasons.

If say a driver decides to register memory without the caller knowing,
it would need to post an extra work request on the send queue.

Yes, the first issue is how to do flow control the sendq.

But this seems easily solved, every ULP is already tracking the
number of available entries in the senq, and it will not start new ops
until there is space, so instead of doing the computation internally
on how much space is needed to do X, we factor it out:

    if (rdma_sqes_post_read(...) < avail_sqe)
      avail_sqe -= rdma_post_read(...)
    else
      // Try again after completions advance

Every new-style post op is paired with a 'how many entires do I need'
call.

This is not a new concept, a ULP working with FRMR already has to know
it cannot start a FRMR using OP unless there is 2 SQE's
available. (and it has to make all this conditional on if it using
FRMR or something else). All this is doing is shifting the computation
of '2' out of the ULP and into the driver.

So once it sees the completion, it needs to silently consume it and
have some non trivial logic to invalidate it (another work request!)
either from poll_cq context or another thread.

Completions are driven by the ULP. Every new-style post also has a
completion entry point. The ULP calls it when it knows the op has
done, either because the WRID it provided has signaled completed, or
because a later op has completed (signalling was supressed).

Since that may also be an implicitly posting API (if necessary, is
it?), it follows the same rules as above. This isn't changing
anything. ULPs would already have to drive invalidate posts from
completion with flow control, we are just moving the actul post
construction and computation of the needed SQEs out of the ULP.

This would also require the drivers to take a huristic approach on how
much memory registration resources are needed for all possible
consumers (ipoib, sdp, srp, iser, nfs, more...) which might have
different requirements.

That doesn't seem like a big issue. The ULP can give a hint on the PD
or QP what sort of usage it expects. 'Up to 16 RDMA READS' 'Up to 1MB
transfer per RDMA' and the core can use a pre-allocated pool scheme.

I was thinking about a pre-allocation for local here, as Christoph
suggests, I think that is a refinement we could certainly add on, once
there is some clear idea what allocations are acutally necessary to
spin up a temp MR. The basic issue I'd see is that the preallocation
would be done without knowledge of the desired SG list, but maybe some
kind of canonical 'max' SG could be used as a stand in...

Put together, it would look like this:
    if (rdma_sqes_post_read(...) < avail_sqe)
          avail_sqe -= rdma_post_read(qp,...,read_wrid)
   [.. fetch wcs ...]
    if (wc.wrid == read_wrid)
         if (rdma_sqes_post_complete_read(...,read_wrid) < avail_sqe)
	      rdma_post_complete_read(qp,...,read_wrid);
         else
	      // queue read_wrid for later rdma_post_complete_read

I'm not really seeing anything here that screams out this is
impossible, or performance is impacted, or it is too messy on either
the ULP or driver side.

I think it is possible (at the moment). But I don't know if we should
have the drivers abusing the send/completion queues like that.

I can't say I'm fully on board with the idea of silent send-queue
posting and silent completion consuming.

Laid out like this, I think it even means we can nuke the IB DMA API
for these cases. rdma_post_read and rdma_post_complete_read are the
two points that need dma api calls (cache flushes), and they can just
do them internally.

This also tells me that the above call sites must already exist in
every ULP, so we, again, are not really substantially changing
core control flow for the ULP.

Are there more details that wreck the world?

Just to break it down:
   - rdma_sqes_post_read figures out how many SQEs are needed to post
     the specified RDMA READ.
       On IB, if the SG list can be used then this is always 1.
       If the RDMA READ is split into N RDMA READS then it is N.
       For something like iWarp this would be (?)
         * FRMR SQE
	* RDMA READ SQE
	* FRMR Invalidate (signaled)

       Presumably we can squeeze FMR and so forth into this scheme as
       well? They don't seem to use SQE's so it is looks simpler..

       Perhaps if an internal MR pool is exhausted this returns 0xFFFF
       and the caller will do a completion cycle, which may provide
       free MR's back to the pool. Ultimately once the SQ and CQ are
       totally drained the pool should be back to 100%?
   - rdma_post_read generates the necessary number of posts.
     The SQ must have the right number of entires available
     (see above)
   - rdma_post_complete_read is doing any clean up posts to make a MR
     ready to go again. Perhaps this isn't even posting?

     Semantically, I'd want to see rdma_post_complete_read returning to
     mean that the local read buffer is ready to go, and the ULP can
     start using it instantly. All invalidation is complete and all
     CPU caches are sync'd.

     This is where we'd start the recycling process for any temp MR,
     whatever that means..

I expect all these calls would be function pointers, and each driver
would provide a function pointer that is optimal for it's use. Eg mlx4
would provide a pointer that used the S/G list, then falls back to
FRMR if the S/G list is exhausted. The core code would provide a
toolbox of common functions the drivers can use here.

Maybe it's just me, but I can't help but wander if this is facilitating
an atmosphere where drivers will keep finding new ways to abuse even
the most simple operations.

I need more time to comprehend.

I didn't explore how errors work, but, I think, errors are just a
labeling exercise:
   if (wc is error && wc.wrid == read_wrid
      rdma_error_complete_read(...,read_wrid,wc)

Error recovery blows up the QP, so we just need to book keep and get
the MRs accounted for. The driver could do a synchronous clean up of
whatever mess is left during the next create_qp, or on the PD destroy.

I know that these are implementation details, but the point is that
vendor drivers can easily become a complete mess. I think we should
try to find a balanced approach where both consumers and providers are
not completely messed up.

Sure, but today vendor drivers and the core is trivial while ULPs are
an absolute mess.

Goal #1 should be to move all the mess into the API and support all
the existing methods. We should try as hard as possible to do that,
and if along the way, it is just isn't possible, then fine. But that
should be the first thing we try to reach for.

Just tidying FRMR so it unifies with indirect, is a fine consolation
prize, but I believe we can do better.

To your point in another message, I'd say, as long as the new API
supports FRMR at full speed with no performance penalty we are
good. If the other variants out there take a performance hit, then I
think that is OK. As you say, they are on the way out, we just need to
make sure that ULPs continue to work with FMR with the new API so
legacy cards don't completely break.

My intention is to improve FRWR API and gradually remove the other APIs
from the kernel (i.e. FMR/FMR_POOL/MW). As I said, I don't think that
striving to an API that implicitly chooses how to register memory is a
good idea.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html