Re: [PATCH v1 1/4] svcrdma: Allocate new transports on device's NUMA node

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 9 Jun 2023 21:07:21 +0000

> On Jun 9, 2023, at 4:40 PM, Tom Talpey <tom@xxxxxxxxxx> wrote:
> 
> On 6/5/2023 9:11 AM, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@xxxxxxxxxx>
>> The physical device's NUMA node ID is available when allocating an
>> svc_xprt for an incoming connection. Use that value to ensure the
>> svc_xprt structure is allocated on the NUMA node closest to the
>> device.
> 
> How is "closest" determined from the device's perspective?

That's up to the device driver and network core.

> While I agree
> that allocating DMA-able control structures on the same CPU socket is
> good for signaling latency, it may or may not be the best choice for
> interrupt processing. And, some architectures push the NUMA envelope
> pretty far, such as presenting memory-only NUMA nodes, or highly
> asymmetrical processor cores. How are those accounted for?

My understanding is devices are never attached to such nodes.
Anyway, I don't intend this patch to address that situation.

> Would it make more sense to push this affinity down to the CQ level,
> associating them with selected cores? One obvious strategy might be
> to affinitize the send and receive cq's to different cores, and
> carefully place send-side and receive-side contexts on separate
> affinitized cachelines, to avoid pingponging.

That is already done. Each CQ is handled on one core, and both
NFS's and NFSD's transport set up code is careful to allocate
the Receive CQ and Send CQ on different cores from each other
when more than one core is available.

It's up to the device driver and the system administrator to
ensure that the IRQs are affined to cores on the same node as
the device.

> I'm not against the idea, just wondering if it goes far enough. Do
> you have numbers, and if so, on what platforms?

I have a two-node system here, configured such that each node
gets its own nfsd thread pool. The difference in performance
is a change in the latency distribution curve... it's narrower
when memory participating in DMA lives on the same node as the
device, which lowers the average completion latency.

But, for the send and receive ctxts (in subsequent patches),
they are already allocated on the same node where the CQ is
allocated, most of the time.

It's patch 1/4 that ensures that the xprt is also on that node:
that's done because the completion handlers access a bunch of
fields in svcxprt_rdma.

I think the main purpose of these patches is documentary; but
I do see a small uptick in performance with 1/4.

> Tom.
> 
>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>> ---
>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   18 +++++++++---------
>>  1 file changed, 9 insertions(+), 9 deletions(-)
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index ca04f7a6a085..2abd895046ee 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -64,7 +64,7 @@
>>  #define RPCDBG_FACILITY RPCDBG_SVCXPRT
>>    static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
>> - struct net *net);
>> + struct net *net, int node);
>>  static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
>>   struct net *net,
>>   struct sockaddr *sa, int salen,
>> @@ -123,14 +123,14 @@ static void qp_event_handler(struct ib_event *event, void *context)
>>  }
>>    static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
>> - struct net *net)
>> + struct net *net, int node)
>>  {
>> - struct svcxprt_rdma *cma_xprt = kzalloc(sizeof *cma_xprt, GFP_KERNEL);
>> + struct svcxprt_rdma *cma_xprt;
>>  - if (!cma_xprt) {
>> - dprintk("svcrdma: failed to create new transport\n");
>> + cma_xprt = kzalloc_node(sizeof(*cma_xprt), GFP_KERNEL, node);
>> + if (!cma_xprt)
>>   return NULL;
>> - }
>> +
>>   svc_xprt_init(net, &svc_rdma_class, &cma_xprt->sc_xprt, serv);
>>   INIT_LIST_HEAD(&cma_xprt->sc_accept_q);
>>   INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
>> @@ -193,9 +193,9 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>>   struct svcxprt_rdma *newxprt;
>>   struct sockaddr *sa;
>>  - /* Create a new transport */
>>   newxprt = svc_rdma_create_xprt(listen_xprt->sc_xprt.xpt_server,
>> -       listen_xprt->sc_xprt.xpt_net);
>> +       listen_xprt->sc_xprt.xpt_net,
>> +       ibdev_to_node(new_cma_id->device));
>>   if (!newxprt)
>>   return;
>>   newxprt->sc_cm_id = new_cma_id;
>> @@ -304,7 +304,7 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
>>     if (sa->sa_family != AF_INET && sa->sa_family != AF_INET6)
>>   return ERR_PTR(-EAFNOSUPPORT);
>> - cma_xprt = svc_rdma_create_xprt(serv, net);
>> + cma_xprt = svc_rdma_create_xprt(serv, net, NUMA_NO_NODE);
>>   if (!cma_xprt)
>>   return ERR_PTR(-ENOMEM);
>>   set_bit(XPT_LISTENER, &cma_xprt->sc_xprt.xpt_flags);

--
Chuck Lever