Re: [PATCH 09/15] sunrpc: Report per-RPC execution stats

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Mar 14, 2018, at 9:11 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Mar 14, 2018, at 4:50 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>> 
>> On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote:
>>> Introduce a mechanism to report the server-side execution latency of
>>> each RPC. The goal is to enable user space to filter the trace
>>> record for latency outliers, build histograms, etc.
>> 
>> Neato.
>> 
>> To be useful I'd think you'd want more information about each RPC.
>> (E.g. you'd like to know that the latency outliers all did reads).  I
>> guess you could use the address and xid to cross-reference with
>> information collected from somewhere else?
> 
> Yes. You can enable other trace points (like the nfsd_compound ones)
> to see what class each operation is in.
> 
> And yes, I would like to have all the relevant information for each
> RPC in a single trace record; I just haven't figured out a way to
> extract it as nicely as I did it on the client (patch forthcoming).
> On the client side, there is a table set up for each RPC program that
> contains an RPC procedure name to procedure number mapping. I was not
> able to find a similar convenience on the server side.
> 
> 
>> What's our commitment to backwards compatibility?  Trace points seem to
>> be the wild west compared to the rest of the kernel interface, but if
>> we want to encourage tooling on top of this then I guess we'll need to
>> be strict.
> 
> That has been discussed elsewhere (LWN.net and more recently on
> linux-rdma). The only compatibility issues are with trace points that
> have user tools and infrastructure that depends on them, such as the
> scheduler trace points used for latencyTOP. The NFS and sunrpc trace
> points do not have this constraint, as they are processed currently
> only by generic tools like trace-cmd. So we are free to innovate for
> the time being.
> 
> 
>> Looking at the tcp case, I think it's measuring from the time
>> tcp_recvfrom receives the last fragment making up an rpc request till
>> the last sendpage() of the reply returns.  Did you consider other spots?
>> (E.g. why after the send instead of before?)
> 
> Yes, I've considered other spots. I don't consider the spots I'm
> proposing here to be written in stone. I welcome help for placing the
> socket-based timestamp capture points.
> 
> Some sendto implementations are more complex than others. For instance,
> RPC/RDMA can post RDMA Writes containing data content first, then in the
> final step post the RDMA Send carrying the RPC Reply header. The RDMA
> Write step can be considered server-side processing, and thus is part
> of the latency. Or, if we ever decide to move the RDMA Write step into
> the XDR layer, it will definitely be counted as processing latency.
> 
> One thing I would like to keep in the latency measurement is how long
> this rqstp has waited to acquire the send mutex. But otherwise, I'm
> open to other ideas about how to measure this latency.

Hi Bruce-

How about measuring the same way for all transports:

- Capture a timestamp when xpo_recvfrom returns a positive value

- Fire the svc_stats_latency event just before invoking xpo_sendto

Would you be more comfortable with that arrangement?


>> --b.
>> 
>>> 
>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> ---
>>> include/linux/sunrpc/svc.h               |    1 +
>>> include/linux/sunrpc/svc_rdma.h          |    1 +
>>> include/trace/events/sunrpc.h            |   20 ++++++++++++++++++++
>>> net/sunrpc/svc_xprt.c                    |    2 +-
>>> net/sunrpc/svcsock.c                     |    2 ++
>>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |    1 +
>>> net/sunrpc/xprtrdma/svc_rdma_transport.c |    1 +
>>> 7 files changed, 27 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>> index 786ae22..1939709 100644
>>> --- a/include/linux/sunrpc/svc.h
>>> +++ b/include/linux/sunrpc/svc.h
>>> @@ -245,6 +245,7 @@ struct svc_rqst {
>>> 	void *			rq_xprt_ctxt;	/* transport specific context ptr */
>>> 	struct svc_deferred_req*rq_deferred;	/* deferred request we are replaying */
>>> 
>>> +	ktime_t			rq_rtime;	/* receive time */
>>> 	size_t			rq_xprt_hlen;	/* xprt header len */
>>> 	struct xdr_buf		rq_arg;
>>> 	struct xdr_buf		rq_res;
>>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>>> index 4b731b0..8045981 100644
>>> --- a/include/linux/sunrpc/svc_rdma.h
>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>> @@ -79,6 +79,7 @@ struct svc_rdma_op_ctxt {
>>> 	struct list_head list;
>>> 	struct xdr_buf arg;
>>> 	struct ib_cqe cqe;
>>> +	ktime_t rtime;
>>> 	u32 byte_len;
>>> 	struct svcxprt_rdma *xprt;
>>> 	enum dma_data_direction direction;
>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
>>> index 5a8157c..50de512 100644
>>> --- a/include/trace/events/sunrpc.h
>>> +++ b/include/trace/events/sunrpc.h
>>> @@ -705,6 +705,26 @@
>>> 		__entry->len, show_svc_xprt_flags(__entry->flags))
>>> );
>>> 
>>> +TRACE_EVENT(svc_stats_latency,
>>> +	TP_PROTO(const struct svc_rqst *rqst, ktime_t end),
>>> +
>>> +	TP_ARGS(rqst, end),
>>> +
>>> +	TP_STRUCT__entry(
>>> +		__field(u32, xid)
>>> +		__field(unsigned long, execute)
>>> +		__string(addr, rqst->rq_xprt->xpt_remotebuf)
>>> +	),
>>> +
>>> +	TP_fast_assign(
>>> +		__entry->xid = be32_to_cpu(rqst->rq_xid);
>>> +		__entry->execute = ktime_to_us(ktime_sub(end, rqst->rq_rtime));
>>> +		__assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
>>> +	),
>>> +
>>> +	TP_printk("addr=%s xid=0x%08x execute=%lu",
>>> +		__get_str(addr), __entry->xid, __entry->execute)
>>> +);
>>> 
>>> DECLARE_EVENT_CLASS(svc_deferred_event,
>>> 	TP_PROTO(struct svc_deferred_req *dr),
>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>> index f745754..2556e6e 100644
>>> --- a/net/sunrpc/svc_xprt.c
>>> +++ b/net/sunrpc/svc_xprt.c
>>> @@ -782,7 +782,6 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
>>> 			len = svc_deferred_recv(rqstp);
>>> 		else
>>> 			len = xprt->xpt_ops->xpo_recvfrom(rqstp);
>>> -		dprintk("svc: got len=%d\n", len);
>>> 		rqstp->rq_reserved = serv->sv_max_mesg;
>>> 		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>>> 	}
>>> @@ -893,6 +892,7 @@ int svc_send(struct svc_rqst *rqstp)
>>> 		len = -ENOTCONN;
>>> 	else
>>> 		len = xprt->xpt_ops->xpo_sendto(rqstp);
>>> +	trace_svc_stats_latency(rqstp, ktime_get_real());
>>> 	mutex_unlock(&xprt->xpt_mutex);
>>> 	rpc_wake_up(&xprt->xpt_bc_pending);
>>> 	trace_svc_send(rqstp, len);
>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>> index 4ca1d92..f616962 100644
>>> --- a/net/sunrpc/svcsock.c
>>> +++ b/net/sunrpc/svcsock.c
>>> @@ -574,6 +574,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
>>> 		/* Don't enable netstamp, sunrpc doesn't
>>> 		   need that much accuracy */
>>> 	}
>>> +	rqstp->rq_rtime = skb->tstamp;
>>> 	svsk->sk_sk->sk_stamp = skb->tstamp;
>>> 	set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be more data... */
>>> 
>>> @@ -1143,6 +1144,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>>> 	if (len < 0)
>>> 		goto error;
>>> 
>>> +	rqstp->rq_rtime = ktime_get_real();
>>> 	svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
>>> 	if (serv->sv_stats)
>>> 		serv->sv_stats->nettcpcnt++;
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> index 19e9c6b..446b9d6 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> @@ -558,6 +558,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>>> 		goto out_readchunk;
>>> 
>>> complete:
>>> +	rqstp->rq_rtime = ctxt->rtime;
>>> 	svc_rdma_put_context(ctxt, 0);
>>> 	dprintk("svcrdma: recvfrom: xprt=%p, rqstp=%p, rq_arg.len=%u\n",
>>> 		rdma_xprt, rqstp, rqstp->rq_arg.len);
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> index ca211dc..08fa7de 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
>>> 		goto flushed;
>>> 
>>> 	/* All wc fields are now known to be valid */
>>> +	ctxt->rtime = ktime_get_real();
>>> 	ctxt->byte_len = wc->byte_len;
>>> 	spin_lock(&xprt->sc_rq_dto_lock);
>>> 	list_add_tail(&ctxt->list, &xprt->sc_rq_dto_q);
> 
> --
> Chuck Lever
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux