Re: nfs client strange behavior with cpuwait and memory writeback

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 12 Sep 2022 06:40:35 -0400

On Sun, 2022-09-11 at 20:58 +0200, Isak wrote:
> Hi everybody!!!
> 
> I am very happy writing my first email to one of the Linux mailing list.
> 
> I have read the faq and i know this mailing list is not a user help
> desk but i have strange behaviour with memory write back and NFS.
> Maybe someone can help me. I am so sorry if this is not the right
> "forum".
> 
> I did three simple tests writing to the same NFS filesystem and the
> behavior of the cpu and memory is extruding my brain.
> 
> The Environment:
> 
> - Linux RedHat 8.6, 2 vCPU (VMWare VM) and 8 GB RAM (but same behavior
> with Red Hat 7.9)
> 
> - One nfs filesystem mounted with sync and without sync
> 
> 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_with_sync type nfs
> (rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx)
> 
> 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_without_sync type nfs
> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx:)
> 
> - Link between nfs client and nfs server is a 10Gb (Fiber) and iperf3
> data show the link works at maximum speed. No problems here. I know
> there are nfs options like nconnect to improve performance but I am
> interested in linux kernel internals.
> 
> The test:
> 
> 1.- dd in /mnt/test_fs_without_sync
> 
> dd if=/dev/zero of=test.out bs=1M count=5000
> 5000+0 records in
> 5000+0 records out
> 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 21.4122 s, 245 MB/s
> 
> * High cpuwait
> * High nfs latency
> * Writeback in use
> 
> Evidences:
> https://zerobin.net/?43f9bea1953ed7aa#TaUk+K0GDhxjPq1EgJ2aAHgEyhntQ0NQzeFF51d9qI0=
> 
> https://i.stack.imgur.com/pTong.png
> 
> 
> 
> 2.- dd in /mnt/test_fs_with_sync
> 
> dd if=/dev/zero of=test.out bs=1M count=5000
> 5000+0 records in
> 5000+0 records out
> 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 35.6462 s, 147 MB/s
> 
> * High cpuwait
> * Low nfs latency
> * No writeback
> 
> Evidences
> https://zerobin.net/?0ce52c5c5d946d7a#ZeyjHFIp7B+K+65DX2RzEGlp+Oq9rCidAKL8RpKpDJ8=
> 
> https://i.stack.imgur.com/Pf1xS.png
> 
> 
> 
> 3.- dd in /mnt/test_fs_with_sync and oflag=direct
> 
> dd if=/dev/zero of=test.out bs=1M oflag=direct count=5000
> 5000+0 records in
> 5000+0 records out
> 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 34.6491 s, 151 MB/s
> 
> * Low cpuwait
> * Low nfs latency
> * No writeback
> 
> Evidences:
> https://zerobin.net/?03c4aa040a7a5323#bScEK36+Sdcz18VwKnBXNbOsi/qFt/O+qFyNj5FUs8k=
> 
> https://i.stack.imgur.com/Qs6y5.png
> 
> 
> 
> 
> The questions:
> 
> I know write back is an old issue in linux and seems is the problem
> here.I played with vm.dirty_background_bytes/vm.dirty_background_ratio
> and vm.dirty_background_ratio/vm.dirty_background_ratio (i know only
> one is valid) but whatever value put in this tunables I always have
> iowait (except from dd with oflag=direct)
> 
> - In test number 2. How is it possible that it has no nfs latency but
> has a high cpu wait?
> 
> - In test number 2. How is it possible that have almost the same code
> path than test number 1? Test number 2 use a nfs filesystem mounted
> with sync option but seems to use pagecache codepath (see flame graph)
> 

"sync" just means that the write codepaths do an implicit fsync of the
written range after every write. The data still goes through the
pagecache in that case. It just does a (synchronous) flush of the data
to the server and a commit after every 1M (in your case).

> 
> - In test number 1. Why isn't there a change in cpuwait behavior when
> vm.dirty tunables are changed? (i have tested a lot of combinations)
> 
> 

Depends on which tunables you're twiddling, but you have 8G of RAM and
are writing a 5G file. All of that should fit in the pagecache without
needing to flush anything before all the writes are done. I imagine the
vm.dirty tunables don't really come into play in these tests, other than
maybe the background ones, and those shouldn't really affect your
buffered write throughput.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>