Re: [PATCH 09/15] sunrpc: Report per-RPC execution stats

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 22 Mar 2018 17:18:22 -0400

> On Mar 22, 2018, at 5:01 PM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> 
> On Thu, Mar 22, 2018 at 04:32:36PM -0400, Chuck Lever wrote:
>> 
>> 
>>> On Mar 14, 2018, at 9:11 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>> 
>>> 
>>> 
>>>> On Mar 14, 2018, at 4:50 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>>> 
>>>> On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote:
>>>>> Introduce a mechanism to report the server-side execution latency of
>>>>> each RPC. The goal is to enable user space to filter the trace
>>>>> record for latency outliers, build histograms, etc.
>>>> 
>>>> Neato.
>>>> 
>>>> To be useful I'd think you'd want more information about each RPC.
>>>> (E.g. you'd like to know that the latency outliers all did reads).  I
>>>> guess you could use the address and xid to cross-reference with
>>>> information collected from somewhere else?
>>> 
>>> Yes. You can enable other trace points (like the nfsd_compound ones)
>>> to see what class each operation is in.
>>> 
>>> And yes, I would like to have all the relevant information for each
>>> RPC in a single trace record; I just haven't figured out a way to
>>> extract it as nicely as I did it on the client (patch forthcoming).
>>> On the client side, there is a table set up for each RPC program that
>>> contains an RPC procedure name to procedure number mapping. I was not
>>> able to find a similar convenience on the server side.
>>> 
>>> 
>>>> What's our commitment to backwards compatibility?  Trace points seem to
>>>> be the wild west compared to the rest of the kernel interface, but if
>>>> we want to encourage tooling on top of this then I guess we'll need to
>>>> be strict.
>>> 
>>> That has been discussed elsewhere (LWN.net and more recently on
>>> linux-rdma). The only compatibility issues are with trace points that
>>> have user tools and infrastructure that depends on them, such as the
>>> scheduler trace points used for latencyTOP. The NFS and sunrpc trace
>>> points do not have this constraint, as they are processed currently
>>> only by generic tools like trace-cmd. So we are free to innovate for
>>> the time being.
>>> 
>>> 
>>>> Looking at the tcp case, I think it's measuring from the time
>>>> tcp_recvfrom receives the last fragment making up an rpc request till
>>>> the last sendpage() of the reply returns.  Did you consider other spots?
>>>> (E.g. why after the send instead of before?)
>>> 
>>> Yes, I've considered other spots. I don't consider the spots I'm
>>> proposing here to be written in stone. I welcome help for placing the
>>> socket-based timestamp capture points.
>>> 
>>> Some sendto implementations are more complex than others. For instance,
>>> RPC/RDMA can post RDMA Writes containing data content first, then in the
>>> final step post the RDMA Send carrying the RPC Reply header. The RDMA
>>> Write step can be considered server-side processing, and thus is part
>>> of the latency. Or, if we ever decide to move the RDMA Write step into
>>> the XDR layer, it will definitely be counted as processing latency.
>>> 
>>> One thing I would like to keep in the latency measurement is how long
>>> this rqstp has waited to acquire the send mutex. But otherwise, I'm
>>> open to other ideas about how to measure this latency.
>> 
>> Hi Bruce-
>> 
>> How about measuring the same way for all transports:
>> 
>> - Capture a timestamp when xpo_recvfrom returns a positive value
>> 
>> - Fire the svc_stats_latency event just before invoking xpo_sendto
>> 
>> Would you be more comfortable with that arrangement?
> 
> I honestly don't know.  If I understand right: if we put that event just
> before xpo_sendto, then in the case you describe above (RDMA Write step
> moved between xdr layer and sendto), the latency number would change for
> no really good reason.  So that's a case for your original approach?
> 
> I don't know what's likely to contribute to sendto latency in the socket
> case.

Me neither, so maybe it makes sense to exclude sendto for now until
we have a better sense of what needs to be measured.

> Really, I've never done latency tracing, and from what I've seen you may
> have done more than any of us, so I trust your judgement here....

Only by an hour or two.

What we leave out with this proposal is the cost of xpo_sendto, and the
cost of the svc_xprt_enqueue and subsequent context switch before the
server calls xpo_recvfrom.

If we want to get down-and-dirty with particular transport
mechanisms, it might be better to insert special trace points for
those.

The enqueue wake-up latency is actually pretty significant in some
cases. We could add a latency report to trace_svc_xprt_dequeue, for
example.

I can respin that patch series with these changes.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html