On 5 May 2024, at 7:29, Thomas Glanzmann wrote: > Hello, > I often take snapshots in order to move kvm VMs from one nfs share to > another while they're running or to take backups. Sometimes I have very > large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours) to > backup or move. They also write between 20 - 60 GB of data while being > backed up or moved. Once the backup or move is done the dirty snapshot > data needs to be merged to the parent disk. While doing this I often > experience I/O stalls within the VMs in the range of 1 - 20 seconds. > Sometimes worse. But I have some very latency sensitive VMs which crash > or misbehave after 15 seconds I/O stalls. So I would like to know if there > is some tuening I can do to make these I/O stalls shorter. > > - I already tried to set vm.dirty_expire_centisecs=100 which appears to > make it better, but not under 15 seconds. Perfect would be I/O stalls > no more than 1 second. > > This is how you can reproduce the issue: > > - NFS Server: > mkdir /ssd > apt install -y nfs-kernel-server > echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)' > /etc/exports > exports -ra > > - NFS Client / KVM Host: > mount server:/ssd /mnt > # Put a VM on /mnt and start it. > # Create a snapshot: > virsh snapshot-create-as --domain testy guest-state1 --diskspec vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no-metadata What NFS version ends up getting mounted here? You might eliminate some head-of-line blocking issues with the "nconnect=16" mount option to open additional TCP connections. My view of what could be happening is that the IO from your guest's process is congesting with the IO from your 'virsh blockcommit' process, and we don't currently have a great way to classify and queue IO from various sources in various ways. Ben