On Sun, 2024-05-05 at 13:29 +0200, Thomas Glanzmann wrote: > Hello, > I often take snapshots in order to move kvm VMs from one nfs share to > another while they're running or to take backups. Sometimes I have > very > large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours) > to > backup or move. They also write between 20 - 60 GB of data while > being > backed up or moved. Once the backup or move is done the dirty > snapshot > data needs to be merged to the parent disk. While doing this I often > experience I/O stalls within the VMs in the range of 1 - 20 seconds. > Sometimes worse. But I have some very latency sensitive VMs which > crash > or misbehave after 15 seconds I/O stalls. So I would like to know if > there > is some tuening I can do to make these I/O stalls shorter. > > - I already tried to set vm.dirty_expire_centisecs=100 which appears > to > make it better, but not under 15 seconds. Perfect would be I/O > stalls > no more than 1 second. > > This is how you can reproduce the issue: > > - NFS Server: > mkdir /ssd > apt install -y nfs-kernel-server > echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)' > > /etc/exports > exports -ra > > - NFS Client / KVM Host: > mount server:/ssd /mnt > # Put a VM on /mnt and start it. > # Create a snapshot: > virsh snapshot-create-as --domain testy guest-state1 --diskspec > vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no- > metadata > > - In the VM: > > # Write some data (in my case 6 GB of data are writen in 60 seconds > due > # to the nfs client being connected with a 1 Gbit/s link) > fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m -- > numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting -- > directory=/mnt --name=write --blocksize=1m --iodepth=1 -- > readwrite=write --unlink=1 > # Do some synchronous I/O > while true; do date | tee -a date.log; sync; sleep 1; done > > - On the NFS Client / KVM host: > # Merge the snapshot into the parentdisk > time virsh blockcommit testy vda --active --pivot --delete > > Successfully pivoted > > real 1m4.666s > user 0m0.017s > sys 0m0.007s > > I exported the nfs share with sync on purpose because I often use > drbd > in sync mode (protocol c) to replicate the data on the nfs server to > a > site which is 200 km away using a 10 Gbit/s link. > > The result is: > (testy) [~] while true; do date | tee -a date.log; sync; sleep 1; > done > Sun May 5 12:53:36 CEST 2024 > Sun May 5 12:53:37 CEST 2024 > Sun May 5 12:53:38 CEST 2024 > Sun May 5 12:53:39 CEST 2024 > Sun May 5 12:53:40 CEST 2024 > Sun May 5 12:53:41 CEST 2024 < here I started virsh blockcommit > Sun May 5 12:53:45 CEST 2024 > Sun May 5 12:53:50 CEST 2024 > Sun May 5 12:53:59 CEST 2024 > Sun May 5 12:54:04 CEST 2024 > Sun May 5 12:54:22 CEST 2024 > Sun May 5 12:54:23 CEST 2024 > Sun May 5 12:54:27 CEST 2024 > Sun May 5 12:54:32 CEST 2024 > Sun May 5 12:54:40 CEST 2024 > Sun May 5 12:54:42 CEST 2024 > Sun May 5 12:54:45 CEST 2024 > Sun May 5 12:54:46 CEST 2024 > Sun May 5 12:54:47 CEST 2024 > Sun May 5 12:54:48 CEST 2024 > Sun May 5 12:54:49 CEST 2024 > > This is with 'vm.dirty_expire_centisecs=100' with the default values > 'vm.dirty_expire_centisecs=3000' it is worse. > > I/O stalls: > - 4 seconds > - 9 seconds > - 5 seconds > - 18 seconds > - 4 seconds > - 5 seconds > - 8 seconds > - 2 seconds > - 3 seconds > > With the default vm.dirty_expire_centisecs=3000 I get something like > that: > > (testy) [~] while true; do date | tee -a date.log; sync; sleep 1; > done > Sun May 5 11:51:33 CEST 2024 > Sun May 5 11:51:34 CEST 2024 > Sun May 5 11:51:35 CEST 2024 > Sun May 5 11:51:37 CEST 2024 > Sun May 5 11:51:38 CEST 2024 > Sun May 5 11:51:39 CEST 2024 > Sun May 5 11:51:40 CEST 2024 << virsh blockcommit > Sun May 5 11:51:49 CEST 2024 > Sun May 5 11:52:07 CEST 2024 > Sun May 5 11:52:08 CEST 2024 > Sun May 5 11:52:27 CEST 2024 > Sun May 5 11:52:45 CEST 2024 > Sun May 5 11:52:47 CEST 2024 > Sun May 5 11:52:48 CEST 2024 > Sun May 5 11:52:49 CEST 2024 > > I/O stalls: > > - 9 seconds > - 18 seconds > - 19 seconds > - 18 seconds > - 1 seconds > > I'm open to any suggestions which improve the situation. I often have > 10 > Gbit/s network and a lot of dirty buffer cache, but at the same time > I > often replicate synchronously to a second site 200 kms apart which > only > gives me around 100 MB/s write performance. > > With vm.dirty_expire_centisecs=10 even worse: > > (testy) [~] while true; do date | tee -a date.log; sync; sleep 1; > done > Sun May 5 13:25:31 CEST 2024 > Sun May 5 13:25:32 CEST 2024 > Sun May 5 13:25:33 CEST 2024 > Sun May 5 13:25:34 CEST 2024 > Sun May 5 13:25:35 CEST 2024 > Sun May 5 13:25:36 CEST 2024 > Sun May 5 13:25:37 CEST 2024 < virsh blockcommit > Sun May 5 13:26:00 CEST 2024 > Sun May 5 13:26:01 CEST 2024 > Sun May 5 13:26:06 CEST 2024 > Sun May 5 13:26:11 CEST 2024 > Sun May 5 13:26:40 CEST 2024 > Sun May 5 13:26:42 CEST 2024 > Sun May 5 13:26:43 CEST 2024 > Sun May 5 13:26:44 CEST 2024 > > I/O stalls: > > - 23 seconds > - 5 seconds > - 5 seconds > - 29 seconds > - 1 second > > Cheers, > Thomas > Two suggestions: 1. Try mounting the NFS partition on which these VMs reside with the "write=eager" mount option. That ensures that the kernel kicks off the write of the block immediately once QEMU has scheduled it for writeback. Note, however that the kernel does not wait for that write to complete (i.e. these writes are all asynchronous). 2. Alternatively, try playing with the 'vm.dirty_ratio' or 'vm.dirty_bytes' values in order to trigger writeback at an earlier time. With the default value of vm.dirty_ratio=20, you can end up caching up to 20% of your total memory's worth of dirty data before the VM triggers writeback over that 1Gbit link. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx