Re: I/O stalls when merging qcow2 snapshots on nfs

Benjamin Coddington <bcodding@xxxxxxxxxx> · Mon, 06 May 2024 07:25:34 -0400

On 5 May 2024, at 7:29, Thomas Glanzmann wrote:

> Hello,
> I often take snapshots in order to move kvm VMs from one nfs share to
> another while they're running or to take backups. Sometimes I have very
> large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours) to
> backup or move. They also write between 20 - 60 GB of data while being
> backed up or moved. Once the backup or move is done the dirty snapshot
> data needs to be merged to the parent disk. While doing this I often
> experience I/O stalls within the VMs in the range of 1 - 20 seconds.
> Sometimes worse. But I have some very latency sensitive VMs which crash
> or misbehave after 15 seconds I/O stalls. So I would like to know if there
> is some tuening I can do to make these I/O stalls shorter.
>
> - I already tried to set vm.dirty_expire_centisecs=100 which appears to
>   make it better, but not under 15 seconds. Perfect would be I/O stalls
>   no more than 1 second.
>
> This is how you can reproduce the issue:
>
> - NFS Server:
> mkdir /ssd
> apt install -y nfs-kernel-server
> echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)' > /etc/exports
> exports -ra
>
> - NFS Client / KVM Host:
> mount server:/ssd /mnt
> # Put a VM on /mnt and start it.
> # Create a snapshot:
> virsh snapshot-create-as --domain testy guest-state1 --diskspec vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no-metadata

What NFS version ends up getting mounted here?  You might eliminate some
head-of-line blocking issues with the "nconnect=16" mount option to open
additional TCP connections.

My view of what could be happening is that the IO from your guest's process
is congesting with the IO from your 'virsh blockcommit' process, and we
don't currently have a great way to classify and queue IO from various
sources in various ways.

Ben