> On May 29, 2020, at 9:02 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > While testing other things, I noticed that several iozone tests showed > a significant regression in large direct WRITE performance with little > to no drop in small WRITE IOPS. > > One example (NFS/RDMA on FDR InfiniBand): > > Machine = Linux manet.1015granger.net 5.7.0-rc7-00033-g8de6ca0614d4 #1071 SMP > CPU utilization Resolution = 0.000 seconds. > CPU utilization Excel chart enabled > File size set to 1048576 kB > Record Size 256 kB > O_DIRECT feature enabled > Command line used: /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I > Output is in kBytes/sec > Time Resolution = 0.000001 seconds. > Processor cache size set to 1024 kBytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > Throughput test with 12 processes > Each process writes a 1048576 kByte file in 256 kByte records > > Children see throughput for 12 initial writers = 2430898.66 kB/sec > Parent sees throughput for 12 initial writers = 2425731.85 kB/sec > Min throughput per process = 202025.03 kB/sec > Max throughput per process = 202899.33 kB/sec > Avg throughput per process = 202574.89 kB/sec > Min xfer = 1044224.00 kB > CPU Utilization: Wall time 5.179 CPU time 2.020 CPU utilization 39.00 % > > Children see throughput for 12 rewriters = 2431774.06 kB/sec > Parent sees throughput for 12 rewriters = 2431230.83 kB/sec > Min throughput per process = 202230.42 kB/sec > Max throughput per process = 202926.08 kB/sec > Avg throughput per process = 202647.84 kB/sec > Min xfer = 1045248.00 kB > CPU utilization: Wall time 5.169 CPU time 2.015 CPU utilization 38.99 % > > These numbers are half what they usually are. > > I bisected between v5.6 and v5.7-rc7, and it terminated on 1f28476dcb98 > ("NFS: Fix O_DIRECT commit verifier handling"). > > This commit doesn't revert cleanly -- the kernel won't build after it is > reverted, so I can't easily do the obvious test to confirm the bisect > result. > > I intend to look into the exact pathology, but wanted to get this regression > reported first, in case someone has a thought about what is slowing things > down. The observed behavior is that the client sends every WRITE twice: once as an UNSTABLE WRITE plus a COMMIT, and once as a FILE_SYNC WRITE. This is because the nfs_write_match_verf() check in nfs_direct_commit_complete() fails for every on-the-wire WRITE. Buffered writes use nfs_write_completion(), which sets req->wb_verf correctly. Direct writes use nfs_direct_write_completion(), which does not set req->wb_verf at all. This leaves req->wb_verf set to all zeroes for every direct WRITE, and thus nfs_direct_commit_completion always requests a resend. I confirmed all this by adding temporary tracepoints in the write completion paths. Seems like the fix is to duplicate the guts of nfs_write_completion() in nfs_direct_write_completion() (or refactor the guts into helpers that both functions invoke). -- Chuck Lever