> On Mar 14, 2018, at 9:11 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > > >> On Mar 14, 2018, at 4:50 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote: >> >> On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote: >>> Introduce a mechanism to report the server-side execution latency of >>> each RPC. The goal is to enable user space to filter the trace >>> record for latency outliers, build histograms, etc. >> >> Neato. >> >> To be useful I'd think you'd want more information about each RPC. >> (E.g. you'd like to know that the latency outliers all did reads). I >> guess you could use the address and xid to cross-reference with >> information collected from somewhere else? > > Yes. You can enable other trace points (like the nfsd_compound ones) > to see what class each operation is in. > > And yes, I would like to have all the relevant information for each > RPC in a single trace record; I just haven't figured out a way to > extract it as nicely as I did it on the client (patch forthcoming). > On the client side, there is a table set up for each RPC program that > contains an RPC procedure name to procedure number mapping. I was not > able to find a similar convenience on the server side. > > >> What's our commitment to backwards compatibility? Trace points seem to >> be the wild west compared to the rest of the kernel interface, but if >> we want to encourage tooling on top of this then I guess we'll need to >> be strict. > > That has been discussed elsewhere (LWN.net and more recently on > linux-rdma). The only compatibility issues are with trace points that > have user tools and infrastructure that depends on them, such as the > scheduler trace points used for latencyTOP. The NFS and sunrpc trace > points do not have this constraint, as they are processed currently > only by generic tools like trace-cmd. So we are free to innovate for > the time being. > > >> Looking at the tcp case, I think it's measuring from the time >> tcp_recvfrom receives the last fragment making up an rpc request till >> the last sendpage() of the reply returns. Did you consider other spots? >> (E.g. why after the send instead of before?) > > Yes, I've considered other spots. I don't consider the spots I'm > proposing here to be written in stone. I welcome help for placing the > socket-based timestamp capture points. > > Some sendto implementations are more complex than others. For instance, > RPC/RDMA can post RDMA Writes containing data content first, then in the > final step post the RDMA Send carrying the RPC Reply header. The RDMA > Write step can be considered server-side processing, and thus is part > of the latency. Or, if we ever decide to move the RDMA Write step into > the XDR layer, it will definitely be counted as processing latency. > > One thing I would like to keep in the latency measurement is how long > this rqstp has waited to acquire the send mutex. But otherwise, I'm > open to other ideas about how to measure this latency. Hi Bruce- How about measuring the same way for all transports: - Capture a timestamp when xpo_recvfrom returns a positive value - Fire the svc_stats_latency event just before invoking xpo_sendto Would you be more comfortable with that arrangement? >> --b. >> >>> >>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>> --- >>> include/linux/sunrpc/svc.h | 1 + >>> include/linux/sunrpc/svc_rdma.h | 1 + >>> include/trace/events/sunrpc.h | 20 ++++++++++++++++++++ >>> net/sunrpc/svc_xprt.c | 2 +- >>> net/sunrpc/svcsock.c | 2 ++ >>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 + >>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 + >>> 7 files changed, 27 insertions(+), 1 deletion(-) >>> >>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h >>> index 786ae22..1939709 100644 >>> --- a/include/linux/sunrpc/svc.h >>> +++ b/include/linux/sunrpc/svc.h >>> @@ -245,6 +245,7 @@ struct svc_rqst { >>> void * rq_xprt_ctxt; /* transport specific context ptr */ >>> struct svc_deferred_req*rq_deferred; /* deferred request we are replaying */ >>> >>> + ktime_t rq_rtime; /* receive time */ >>> size_t rq_xprt_hlen; /* xprt header len */ >>> struct xdr_buf rq_arg; >>> struct xdr_buf rq_res; >>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h >>> index 4b731b0..8045981 100644 >>> --- a/include/linux/sunrpc/svc_rdma.h >>> +++ b/include/linux/sunrpc/svc_rdma.h >>> @@ -79,6 +79,7 @@ struct svc_rdma_op_ctxt { >>> struct list_head list; >>> struct xdr_buf arg; >>> struct ib_cqe cqe; >>> + ktime_t rtime; >>> u32 byte_len; >>> struct svcxprt_rdma *xprt; >>> enum dma_data_direction direction; >>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h >>> index 5a8157c..50de512 100644 >>> --- a/include/trace/events/sunrpc.h >>> +++ b/include/trace/events/sunrpc.h >>> @@ -705,6 +705,26 @@ >>> __entry->len, show_svc_xprt_flags(__entry->flags)) >>> ); >>> >>> +TRACE_EVENT(svc_stats_latency, >>> + TP_PROTO(const struct svc_rqst *rqst, ktime_t end), >>> + >>> + TP_ARGS(rqst, end), >>> + >>> + TP_STRUCT__entry( >>> + __field(u32, xid) >>> + __field(unsigned long, execute) >>> + __string(addr, rqst->rq_xprt->xpt_remotebuf) >>> + ), >>> + >>> + TP_fast_assign( >>> + __entry->xid = be32_to_cpu(rqst->rq_xid); >>> + __entry->execute = ktime_to_us(ktime_sub(end, rqst->rq_rtime)); >>> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf); >>> + ), >>> + >>> + TP_printk("addr=%s xid=0x%08x execute=%lu", >>> + __get_str(addr), __entry->xid, __entry->execute) >>> +); >>> >>> DECLARE_EVENT_CLASS(svc_deferred_event, >>> TP_PROTO(struct svc_deferred_req *dr), >>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c >>> index f745754..2556e6e 100644 >>> --- a/net/sunrpc/svc_xprt.c >>> +++ b/net/sunrpc/svc_xprt.c >>> @@ -782,7 +782,6 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt) >>> len = svc_deferred_recv(rqstp); >>> else >>> len = xprt->xpt_ops->xpo_recvfrom(rqstp); >>> - dprintk("svc: got len=%d\n", len); >>> rqstp->rq_reserved = serv->sv_max_mesg; >>> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); >>> } >>> @@ -893,6 +892,7 @@ int svc_send(struct svc_rqst *rqstp) >>> len = -ENOTCONN; >>> else >>> len = xprt->xpt_ops->xpo_sendto(rqstp); >>> + trace_svc_stats_latency(rqstp, ktime_get_real()); >>> mutex_unlock(&xprt->xpt_mutex); >>> rpc_wake_up(&xprt->xpt_bc_pending); >>> trace_svc_send(rqstp, len); >>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c >>> index 4ca1d92..f616962 100644 >>> --- a/net/sunrpc/svcsock.c >>> +++ b/net/sunrpc/svcsock.c >>> @@ -574,6 +574,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) >>> /* Don't enable netstamp, sunrpc doesn't >>> need that much accuracy */ >>> } >>> + rqstp->rq_rtime = skb->tstamp; >>> svsk->sk_sk->sk_stamp = skb->tstamp; >>> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be more data... */ >>> >>> @@ -1143,6 +1144,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) >>> if (len < 0) >>> goto error; >>> >>> + rqstp->rq_rtime = ktime_get_real(); >>> svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt); >>> if (serv->sv_stats) >>> serv->sv_stats->nettcpcnt++; >>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c >>> index 19e9c6b..446b9d6 100644 >>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c >>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c >>> @@ -558,6 +558,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp) >>> goto out_readchunk; >>> >>> complete: >>> + rqstp->rq_rtime = ctxt->rtime; >>> svc_rdma_put_context(ctxt, 0); >>> dprintk("svcrdma: recvfrom: xprt=%p, rqstp=%p, rq_arg.len=%u\n", >>> rdma_xprt, rqstp, rqstp->rq_arg.len); >>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c >>> index ca211dc..08fa7de 100644 >>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c >>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c >>> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc) >>> goto flushed; >>> >>> /* All wc fields are now known to be valid */ >>> + ctxt->rtime = ktime_get_real(); >>> ctxt->byte_len = wc->byte_len; >>> spin_lock(&xprt->sc_rq_dto_lock); >>> list_add_tail(&ctxt->list, &xprt->sc_rq_dto_q); > > -- > Chuck Lever > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html