Re: [PATCH 0/5] Indirect memory registration feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jun 9, 2015, at 4:44 AM, Sagi Grimberg <sagig@xxxxxxxxxxxxxxxxxx> wrote:

> On 6/9/2015 9:20 AM, Christoph Hellwig wrote:
>> On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:
>>> I wouldn't say this is about offloading bounce buffering to silicon.
>>> The RDMA stack always imposed the alignment limitation as we can only
>>> give a page lists to the devices. Other drivers (qlogic/emulex FC
>>> drivers for example), use an _arbitrary_ SG lists where each element can
>>> point to any {addr, len}.
>> 
>> Those are drivers for protocols that support real SG lists.   It seems
>> only Infiniband and NVMe expose this silly limit.
> 
> I agree this is indeed a limitation and that's why SG_GAPS was added
> in the first place. I think the next gen of nvme devices will support real SG lists. This feature enables existing Infiniband devices that can handle SG lists to receive them via the RDMA stack (ib_core).
> 
> If the memory registration process wasn't such a big fiasco in the
> first place, wouldn't this way makes the most sense?
> 
>> 
>>>> So please fix it in the proper layers
>>>> first,
>>> 
>>> I agree that we can take care of bounce buffering in the block layer
>>> (or scsi for SG_IO) if the driver doesn't want to see any type of
>>> unaligned SG lists.
>>> 
>>> But do you think that it should come before the stack can support this?
>> 
>> Yes, absolutely.  The other thing that needs to come first is a proper
>> abstraction for MRs instead of hacking another type into all drivers.
>> 
> 
> I'm very much open to the idea of consolidating the memory registration
> code instead of doing it in every ULP (srp, iser, xprtrdma, svcrdma,
> rds, more to come...) using a general memory registration API. The main
> challenge is to abstract the different methods (and considerations) of
> memory registration behind an API. Do we completely mask out the way we
> are doing it? I'm worried that we might end up either compromising on
> performance or trying to understand too much what the caller is trying
> to achieve.

The point of an API like this is to flatten the developer’s learning
curve at the cost of adding another layer of abstraction. For
in-kernel storage ULPs, I’m not convinced that’s a good trade-off.

The other major issue is dealing with multiple registration methods.
Having new ULPs stick with just one or two seems like it could take
care of that without fuss. HCA vendors seem to be settling on FRMR.

But FRMR has some limitations, some of which I discuss below. IMO it
would be better to think about improving existing limitations with
FRMR before building a shim over it. That could help both the
learning curve and the complexity issues.

On with the pre-caffeine musings:

> For example:
> - frwr requires a queue-pair for the post (and it must be the ULP
>  queue-pair to ensure the registration is done before the data-transfer
>  begins).

The QP does guarantee ordering between registration and use in an RDMA
transfer, but it comes at a cost.

And not just a QP is required, but a QP in RTS (ie, connected). Without
a connected QP, AIUI you can’t invalidate registered FRMRs, you have to
destroy them.

So, for RPC, if the server is down and there are pending RPCs, any FRMRs
associated with that transport are pinned (can’t be re-used or invalidated)
until the server comes back. If an RPC is killed or times out while the
server is down, the associated FRMRs are in limbo.

Registration via WR also means the ULP has to handle the registration and
invalidation completions (or decide to leave them unsignaled, but then
it has to worry about send queue wrapping).

All transport connect operations must be serialized with posting to ensure
that ib_post_send() has a valid (not NULL) QP and ID handle. And, if a QP
or ID handle is used during completion handling, you have to be careful
there too.

(Not to say any of this is impossible to deal with. Obviously RPC/RDMA
works today.)

> While fmrs does not need the queue-pair.

> - the ULP would probably always initiate data transfer after the
>  registration (send a request or do the rdma r/w). It is useful to
>  link the frwr post with the next wr in a single post_send call.
>  I wander how would an API allow such a thing (while other registration
>  methods don't use work request interface).

rkey management is also important. Registration is done before use so
the ULP can send the rkey for that FRMR to the remote to initiate the
remote DMA operation.

However, after a transport disconnect, the rkey in the FRMR may not be
the same one the hardware knows about. Recovery in this case means the
FRMR has to be destroyed. Invalidating the FRMR also requires the ULP
to know the hardware rkey, so it doesn’t work in this case.

What all this means is that significant extra complexity is required to
deal with transport disconnection, which is very rare compared to normal
data transfer operations.

FMR, for instance, has much less of an issue here because the map and
unmap verbs are synchronous, do not rely on having a QP, and do not
generate send completions. But FMR has other problems.

> - There is the fmr_pool API which tries to tackle the disadvantages of
>  fmrs (very slow unmap) by delaying the fmr unmap until some dirty
>  watermark of remapping is met. I'm not sure how this can be done.

I wonder if FMR should be considered at all for a simplified API. Sure,
there are some older cards that do not support FRMR, but seems like FMR
is going to be abandoned sooner or later.

> - How would the API choose the method to register memory?

Based on the capabilities of the HCA, I would think.

> - If there is an alignment issue, do we fail? do we bounce?
> 
> - There is the whole T10-DIF support…

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux