Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 09, 2015 at 02:02:03PM +0300, Sagi Grimberg wrote:

> We have protocol that involves remote memory keys transfer in their
> standards so I don't see how we can remove it altogether from ULPs.

This is why I've been talking about local and remote MRs
differently. A Local MR is one where the Key is never put on the wire,
it exists soley to facilitate DMA between the CPU and the local HCA,
and it would never be needed if we had infinitely long S/G lists.

> My main problem with this approach is that once you do non-trivial
> things such as memory registration completely under the hood, it is
> a slippery slope for device drivers.

Yes, there is going to be some stuff going on, but the simplification
for the ULP side is incredible, it is certainly something that should
be explored and not dismissed without some really good reasons.

> If say a driver decides to register memory without the caller knowing,
> it would need to post an extra work request on the send queue.

Yes, the first issue is how to do flow control the sendq.

But this seems easily solved, every ULP is already tracking the
number of available entries in the senq, and it will not start new ops
until there is space, so instead of doing the computation internally
on how much space is needed to do X, we factor it out:

   if (rdma_sqes_post_read(...) < avail_sqe)
     avail_sqe -= rdma_post_read(...)
   else
     // Try again after completions advance

Every new-style post op is paired with a 'how many entires do I need'
call.

This is not a new concept, a ULP working with FRMR already has to know
it cannot start a FRMR using OP unless there is 2 SQE's
available. (and it has to make all this conditional on if it using
FRMR or something else). All this is doing is shifting the computation
of '2' out of the ULP and into the driver.

> So once it sees the completion, it needs to silently consume it and
> have some non trivial logic to invalidate it (another work request!)
> either from poll_cq context or another thread.

Completions are driven by the ULP. Every new-style post also has a
completion entry point. The ULP calls it when it knows the op has
done, either because the WRID it provided has signaled completed, or
because a later op has completed (signalling was supressed).

Since that may also be an implicitly posting API (if necessary, is
it?), it follows the same rules as above. This isn't changing
anything. ULPs would already have to drive invalidate posts from
completion with flow control, we are just moving the actul post
construction and computation of the needed SQEs out of the ULP.

> This would also require the drivers to take a huristic approach on how
> much memory registration resources are needed for all possible
> consumers (ipoib, sdp, srp, iser, nfs, more...) which might have
> different requirements.

That doesn't seem like a big issue. The ULP can give a hint on the PD
or QP what sort of usage it expects. 'Up to 16 RDMA READS' 'Up to 1MB
transfer per RDMA' and the core can use a pre-allocated pool scheme.

I was thinking about a pre-allocation for local here, as Christoph
suggests, I think that is a refinement we could certainly add on, once
there is some clear idea what allocations are acutally necessary to
spin up a temp MR. The basic issue I'd see is that the preallocation
would be done without knowledge of the desired SG list, but maybe some
kind of canonical 'max' SG could be used as a stand in...

Put together, it would look like this:
   if (rdma_sqes_post_read(...) < avail_sqe)
         avail_sqe -= rdma_post_read(qp,...,read_wrid)
  [.. fetch wcs ...]
   if (wc.wrid == read_wrid)
        if (rdma_sqes_post_complete_read(...,read_wrid) < avail_sqe)
	      rdma_post_complete_read(qp,...,read_wrid);
        else
	      // queue read_wrid for later rdma_post_complete_read

I'm not really seeing anything here that screams out this is
impossible, or performance is impacted, or it is too messy on either
the ULP or driver side.

Laid out like this, I think it even means we can nuke the IB DMA API
for these cases. rdma_post_read and rdma_post_complete_read are the
two points that need dma api calls (cache flushes), and they can just
do them internally.

This also tells me that the above call sites must already exist in
every ULP, so we, again, are not really substantially changing
core control flow for the ULP.

Are there more details that wreck the world?

Just to break it down:
  - rdma_sqes_post_read figures out how many SQEs are needed to post
    the specified RDMA READ.
      On IB, if the SG list can be used then this is always 1.
      If the RDMA READ is split into N RDMA READS then it is N.
      For something like iWarp this would be (?)
        * FRMR SQE
	* RDMA READ SQE
	* FRMR Invalidate (signaled)

      Presumably we can squeeze FMR and so forth into this scheme as
      well? They don't seem to use SQE's so it is looks simpler..

      Perhaps if an internal MR pool is exhausted this returns 0xFFFF
      and the caller will do a completion cycle, which may provide
      free MR's back to the pool. Ultimately once the SQ and CQ are
      totally drained the pool should be back to 100%?
  - rdma_post_read generates the necessary number of posts.
    The SQ must have the right number of entires available
    (see above)
  - rdma_post_complete_read is doing any clean up posts to make a MR
    ready to go again. Perhaps this isn't even posting?

    Semantically, I'd want to see rdma_post_complete_read returning to
    mean that the local read buffer is ready to go, and the ULP can
    start using it instantly. All invalidation is complete and all
    CPU caches are sync'd.

    This is where we'd start the recycling process for any temp MR,
    whatever that means..

I expect all these calls would be function pointers, and each driver
would provide a function pointer that is optimal for it's use. Eg mlx4
would provide a pointer that used the S/G list, then falls back to
FRMR if the S/G list is exhausted. The core code would provide a
toolbox of common functions the drivers can use here.

I didn't explore how errors work, but, I think, errors are just a
labeling exercise:
  if (wc is error && wc.wrid == read_wrid
     rdma_error_complete_read(...,read_wrid,wc)

Error recovery blows up the QP, so we just need to book keep and get
the MRs accounted for. The driver could do a synchronous clean up of
whatever mess is left during the next create_qp, or on the PD destroy.

> I know that these are implementation details, but the point is that
> vendor drivers can easily become a complete mess. I think we should
> try to find a balanced approach where both consumers and providers are
> not completely messed up.

Sure, but today vendor drivers and the core is trivial while ULPs are
an absolute mess.

Goal #1 should be to move all the mess into the API and support all
the existing methods. We should try as hard as possible to do that,
and if along the way, it is just isn't possible, then fine. But that
should be the first thing we try to reach for.

Just tidying FRMR so it unifies with indirect, is a fine consolation
prize, but I believe we can do better.

To your point in another message, I'd say, as long as the new API
supports FRMR at full speed with no performance penalty we are
good. If the other variants out there take a performance hit, then I
think that is OK. As you say, they are on the way out, we just need to
make sure that ULPs continue to work with FMR with the new API so
legacy cards don't completely break.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux