> On Mar 22, 2018, at 5:01 PM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote: > > On Thu, Mar 22, 2018 at 04:32:36PM -0400, Chuck Lever wrote: >> >> >>> On Mar 14, 2018, at 9:11 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: >>> >>> >>> >>>> On Mar 14, 2018, at 4:50 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote: >>>> >>>> On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote: >>>>> Introduce a mechanism to report the server-side execution latency of >>>>> each RPC. The goal is to enable user space to filter the trace >>>>> record for latency outliers, build histograms, etc. >>>> >>>> Neato. >>>> >>>> To be useful I'd think you'd want more information about each RPC. >>>> (E.g. you'd like to know that the latency outliers all did reads). I >>>> guess you could use the address and xid to cross-reference with >>>> information collected from somewhere else? >>> >>> Yes. You can enable other trace points (like the nfsd_compound ones) >>> to see what class each operation is in. >>> >>> And yes, I would like to have all the relevant information for each >>> RPC in a single trace record; I just haven't figured out a way to >>> extract it as nicely as I did it on the client (patch forthcoming). >>> On the client side, there is a table set up for each RPC program that >>> contains an RPC procedure name to procedure number mapping. I was not >>> able to find a similar convenience on the server side. >>> >>> >>>> What's our commitment to backwards compatibility? Trace points seem to >>>> be the wild west compared to the rest of the kernel interface, but if >>>> we want to encourage tooling on top of this then I guess we'll need to >>>> be strict. >>> >>> That has been discussed elsewhere (LWN.net and more recently on >>> linux-rdma). The only compatibility issues are with trace points that >>> have user tools and infrastructure that depends on them, such as the >>> scheduler trace points used for latencyTOP. The NFS and sunrpc trace >>> points do not have this constraint, as they are processed currently >>> only by generic tools like trace-cmd. So we are free to innovate for >>> the time being. >>> >>> >>>> Looking at the tcp case, I think it's measuring from the time >>>> tcp_recvfrom receives the last fragment making up an rpc request till >>>> the last sendpage() of the reply returns. Did you consider other spots? >>>> (E.g. why after the send instead of before?) >>> >>> Yes, I've considered other spots. I don't consider the spots I'm >>> proposing here to be written in stone. I welcome help for placing the >>> socket-based timestamp capture points. >>> >>> Some sendto implementations are more complex than others. For instance, >>> RPC/RDMA can post RDMA Writes containing data content first, then in the >>> final step post the RDMA Send carrying the RPC Reply header. The RDMA >>> Write step can be considered server-side processing, and thus is part >>> of the latency. Or, if we ever decide to move the RDMA Write step into >>> the XDR layer, it will definitely be counted as processing latency. >>> >>> One thing I would like to keep in the latency measurement is how long >>> this rqstp has waited to acquire the send mutex. But otherwise, I'm >>> open to other ideas about how to measure this latency. >> >> Hi Bruce- >> >> How about measuring the same way for all transports: >> >> - Capture a timestamp when xpo_recvfrom returns a positive value >> >> - Fire the svc_stats_latency event just before invoking xpo_sendto >> >> Would you be more comfortable with that arrangement? > > I honestly don't know. If I understand right: if we put that event just > before xpo_sendto, then in the case you describe above (RDMA Write step > moved between xdr layer and sendto), the latency number would change for > no really good reason. So that's a case for your original approach? > > I don't know what's likely to contribute to sendto latency in the socket > case. Me neither, so maybe it makes sense to exclude sendto for now until we have a better sense of what needs to be measured. > Really, I've never done latency tracing, and from what I've seen you may > have done more than any of us, so I trust your judgement here.... Only by an hour or two. What we leave out with this proposal is the cost of xpo_sendto, and the cost of the svc_xprt_enqueue and subsequent context switch before the server calls xpo_recvfrom. If we want to get down-and-dirty with particular transport mechanisms, it might be better to insert special trace points for those. The enqueue wake-up latency is actually pretty significant in some cases. We could add a latency report to trace_svc_xprt_dequeue, for example. I can respin that patch series with these changes. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html