Re: [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 24 Mar 2017 18:19:04 -0400

> On Mar 22, 2017, at 11:41 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> 
>> 
>> On Mar 22, 2017, at 10:17 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
>> 
>>> 
>>> The plan is to replace the local bespoke code that constructs and
>>> posts RDMA Read and Write Work Requests with calls to the rdma_rw
>>> API. This shares code with other RDMA-enabled ULPs that manages the
>>> gory details of buffer registration and posting Work Requests.
>>> 
>>> Some design notes:
>>> 
>>> o svc_xprt reference counting is modified, since one rdma_rw_ctx
>>>  generates one completion, no matter how many Write WRs are
>>>  posted. To accommodate the new reference counting scheme, a new
>>>  version of svc_rdma_send() is introduced.
>>> 
>>> o The structure of RPC-over-RDMA transport headers is flexible,
>>>  allowing multiple segments per Reply with arbitrary alignment.
>>>  Thus I did not take the further step of chaining Write WRs with
>>>  the Send WR containing the RPC Reply message. The Write and Send
>>>  WRs continue to be built by separate pieces of code.
>>> 
>>> o The current code builds the transport header as it is construct-
>>>  ing Write WRs. I've replaced that with marshaling of transport
>>>  header data items in a separate step. This is because the exact
>>>  structure of client-provided segments may not align with the
>>>  components of the server's reply xdr_buf, or the pages in the
>>>  page list. Thus parts of each client-provided segment may be
>>>  written at different points in the send path.
>>> 
>>> o Since the Write list and Reply chunk marshaling code is being
>>>  replaced, I took the opportunity to replace some of the C
>>>  structure-based XDR encoding code with more portable code that
>>>  instead uses pointer arithmetic.
>>> 
>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> 
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>> new file mode 100644
>>> index 0000000..1e76227
>>> --- /dev/null
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>> @@ -0,0 +1,785 @@
>>> +/*
>>> + * Copyright (c) 2016 Oracle.  All rights reserved.
>>> + *
>>> + * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
>>> + */
>>> +
>>> +#include <linux/sunrpc/rpc_rdma.h>
>>> +#include <linux/sunrpc/svc_rdma.h>
>>> +#include <linux/sunrpc/debug.h>
>>> +
>>> +#include <rdma/rw.h>
>>> +
>>> +#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
>>> +
>>> +/* Each R/W context contains state for one chain of RDMA Read or
>>> + * Write Work Requests (one RDMA segment to be read from or written
>>> + * back to the client).
>>> + *
>>> + * Each WR chain handles a single contiguous server-side buffer,
>>> + * because some registration modes (eg. FRWR) do not support a
>>> + * discontiguous scatterlist.
>>> + *
>>> + * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
>>> + * from a client may contain a unique R_key, so each WR chain moves
>>> + * one segment (or less) at a time.
>>> + *
>>> + * The scatterlist makes this data structure just over 8KB in size
>>> + * on 4KB-page platforms. As the size of this structure increases
>>> + * past one page, it becomes more likely that allocating one of these
>>> + * will fail. Therefore, these contexts are created on demand, but
>>> + * cached and reused until the controlling svcxprt_rdma is destroyed.
>>> + */
>>> +struct svc_rdma_rw_ctxt {
>>> +	struct list_head	rw_list;
>>> +	struct ib_cqe		rw_cqe;
>>> +	struct svcxprt_rdma	*rw_rdma;
>>> +	int			rw_nents;
>>> +	int			rw_wrcount;
>>> +	enum dma_data_direction	rw_dir;
>>> +	struct svc_rdma_op_ctxt	*rw_readctxt;
>>> +	struct rdma_rw_ctx	rw_ctx;
>>> +	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];
>> 
>> Have you considered using sg_table with sg_alloc_table_chained?
>> 
>> See lib/sg_pool.c and nvme-rdma as a consumer.
> 
> That might be newer than my patches. I'll have a look.

I looked at the consumers of sg_alloc_table_chained, and
all these callers are immediately doing ib_dma_map_sg.

Nothing in svc_rdma_rw.c does a DMA map. It relies on
rdma_rw_ctx_init for that, and that API is passed a
scatterlist.

I don't see how I could use sg_alloc_table_chained here,
unless rdma_rw_ctx_init was modified to take a chained
sg_table instead of a scatterlist argument.

I suppose I could convert the client side to use it?
What do you think?

>>> +};
>>> +
>>> +static struct svc_rdma_rw_ctxt *
>>> +svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
>>> +{
>>> +	struct svc_rdma_rw_ctxt *ctxt;
>>> +
>>> +	svc_xprt_get(&rdma->sc_xprt);
>>> +
>>> +	spin_lock(&rdma->sc_rw_ctxt_lock);
>>> +	if (list_empty(&rdma->sc_rw_ctxts))
>>> +		goto out_empty;
>>> +
>>> +	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
>>> +				struct svc_rdma_rw_ctxt, rw_list);
>>> +	list_del_init(&ctxt->rw_list);
>>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>>> +
>>> +out:
>>> +	ctxt->rw_dir = DMA_NONE;
>>> +	return ctxt;
>>> +
>>> +out_empty:
>>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>>> +
>>> +	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
>>> +	if (!ctxt) {
>>> +		svc_xprt_put(&rdma->sc_xprt);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	ctxt->rw_rdma = rdma;
>>> +	INIT_LIST_HEAD(&ctxt->rw_list);
>>> +	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
>>> +	goto out;
>>> +}
>>> +
>>> +static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
>>> +{
>>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>>> +
>>> +	if (ctxt->rw_dir != DMA_NONE)
>>> +		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
>>> +				    rdma->sc_port_num,
>>> +				    ctxt->rw_sg, ctxt->rw_nents,
>>> +				    ctxt->rw_dir);
>>> +
>> 
>> its a bit odd to see put_rw_ctxt that also destroys the context
>> which isn't exactly pairs with get_rw_ctxt.
>> 
>> Maybe it'd be useful to explicitly do that outside the put.
> 
> The pairing is not obvious, but it is this:
> 
> svc_rdma_send_writes() does the svc_rdma_get_rw_ctx.
> 
> -> If posting succeeds, svc_rdma_wc_write_ctx puts the ctx.
> 
> -> If posting fails, svc_rdma_send_writes puts the ctx.
> 
> Do you have a suggestion about how this could be more
> intuitively documented?
> 
> IIRC I combined these because the rdma_rw_ctx_destroy is
> always done just before putting the ctx back on the free
> list. It eliminates some code duplication, and ensures
> the ctx is always ready for the next svc_rdma_get_rw_ctx.

I fixed this up, I think it is an improvement. Thanks for
the suggestion.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html