Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 17 Mar 2023 15:29:40 +0000

> On Mar 16, 2023, at 5:21 PM, David Howells <dhowells@xxxxxxxxxx> wrote:
> 
> Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> 
>> Therefore, this kind of change needs to be accompanied by both
>> benchmark results and some field testing to convince me it won't
>> cause harm.
> 
> Btw, what do you use to benchmark NFS performance?

It depends on what I'm trying to observe. I have only a small
handful of systems in my lab, which is why I was not able to
immediately detect the effects of the zero-copy change in my
lab. Daire has a large client cohort on a fast network, so is
able to see the impact of that kind of change quite readily.

A perhaps more interesting question is what kind of tooling
would I use to measure the performance of the proposed change.

The bottom line is whether or not applications on clients can
see a change. NFS client implementations can hide server and
network latency improvement from applications, and RPC-on-TCP
adds palpable latency as well that reduces the efficacy of
server performance optimizations.

For that I might use a multi-threaded fio workload with fixed
record sizes (2KB, 8KB, 128KB, 1MB) and then look at the
throughput numbers and latency distribution for each size.

In a single thread qd=1 test, iozone can show changes in
READ latency pretty clearly, though most folks believe qd=1
tests are junk.

I generally run such tests on 100GbE with a tmpfs or NVMe
export to take filesystem latencies out of the equation,
although that might matter more for WRITE latency if you
can keep your READ workload completely in server memory.

To measure server-side behavior without the effects of the
network or client, NFSD has a built-in trace point,
nfsd:svc_stats_latency, that records the latency in
microseconds of each RPC. Run the above workloads and
record this tracepoint (perhaps with a capture filter to
record only the latency of READ operations).

Then you can post-process the raw latencies to get an average
latency and deviation, or even look at latency distribution
to see if the shape of the outlier curve has changed. I use
awk for this.

[ Sidebar: you can use this tracepoint to track latency
outliers too, but that's another topic. ]

Second, I might try a flame graph study to measure changes in
instruction path length, and also capture an average cycles-
per-byte-read value. Looking at CPU cache misses can often be
a rathole, but flame graphs can surface changes there too.

And lastly, you might want to visit lock_stats to see if
there is any significant change in lock contention. An
unexpected increase in lock contention can often rob
positive changes made in other areas.

My guess is that for the RQ_SPLICE_OK case, the difference
would amount to the elimination of the kernel_sendpage
calls, which are indirect, but not terribly expensive.
Those calls amount to a significant cost only on large I/O.
It might not amount to much relative to the other costs
in the READ path.

So the real purpose here would have to be refactoring to
use bvecs instead of the bespoke xdr_buf structure, and I
would like to see support for bvecs in all of our transports
(looking at you, RDMA) to make this truly worthwhile. I had
started this a while back, but lack of a bvec-based RDMA API
made it less interesting to me. It isn't clear to me yet
whether bvecs or folios should be the replacement for
xdr_buf's head/pages/tail, but I'm a paid-in-full member of
the uneducated rabble.

This might sound like a lot of pushback, but actually I am
open to discussing clean-ups in this area, including the
one you proposed. Just getting a little more careful about
this kind of change as time goes on. And it sounds like you
were already aware of the most recent previous attempt at
this kind of improvement.

--
Chuck Lever