> On Mar 16, 2023, at 5:21 PM, David Howells <dhowells@xxxxxxxxxx> wrote: > > Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > >> Therefore, this kind of change needs to be accompanied by both >> benchmark results and some field testing to convince me it won't >> cause harm. > > Btw, what do you use to benchmark NFS performance? It depends on what I'm trying to observe. I have only a small handful of systems in my lab, which is why I was not able to immediately detect the effects of the zero-copy change in my lab. Daire has a large client cohort on a fast network, so is able to see the impact of that kind of change quite readily. A perhaps more interesting question is what kind of tooling would I use to measure the performance of the proposed change. The bottom line is whether or not applications on clients can see a change. NFS client implementations can hide server and network latency improvement from applications, and RPC-on-TCP adds palpable latency as well that reduces the efficacy of server performance optimizations. For that I might use a multi-threaded fio workload with fixed record sizes (2KB, 8KB, 128KB, 1MB) and then look at the throughput numbers and latency distribution for each size. In a single thread qd=1 test, iozone can show changes in READ latency pretty clearly, though most folks believe qd=1 tests are junk. I generally run such tests on 100GbE with a tmpfs or NVMe export to take filesystem latencies out of the equation, although that might matter more for WRITE latency if you can keep your READ workload completely in server memory. To measure server-side behavior without the effects of the network or client, NFSD has a built-in trace point, nfsd:svc_stats_latency, that records the latency in microseconds of each RPC. Run the above workloads and record this tracepoint (perhaps with a capture filter to record only the latency of READ operations). Then you can post-process the raw latencies to get an average latency and deviation, or even look at latency distribution to see if the shape of the outlier curve has changed. I use awk for this. [ Sidebar: you can use this tracepoint to track latency outliers too, but that's another topic. ] Second, I might try a flame graph study to measure changes in instruction path length, and also capture an average cycles- per-byte-read value. Looking at CPU cache misses can often be a rathole, but flame graphs can surface changes there too. And lastly, you might want to visit lock_stats to see if there is any significant change in lock contention. An unexpected increase in lock contention can often rob positive changes made in other areas. My guess is that for the RQ_SPLICE_OK case, the difference would amount to the elimination of the kernel_sendpage calls, which are indirect, but not terribly expensive. Those calls amount to a significant cost only on large I/O. It might not amount to much relative to the other costs in the READ path. So the real purpose here would have to be refactoring to use bvecs instead of the bespoke xdr_buf structure, and I would like to see support for bvecs in all of our transports (looking at you, RDMA) to make this truly worthwhile. I had started this a while back, but lack of a bvec-based RDMA API made it less interesting to me. It isn't clear to me yet whether bvecs or folios should be the replacement for xdr_buf's head/pages/tail, but I'm a paid-in-full member of the uneducated rabble. This might sound like a lot of pushback, but actually I am open to discussing clean-ups in this area, including the one you proposed. Just getting a little more careful about this kind of change as time goes on. And it sounds like you were already aware of the most recent previous attempt at this kind of improvement. -- Chuck Lever