Re: RBD snapshot atomicity guarantees?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 18/12/2018 20:29, Oliver Freyermuth wrote:
Potentially, if granted arbitrary command execution by the guest agent, you could check (there might be a better interface than parsing meminfo...):
   cat /proc/meminfo | grep -i dirty
   Dirty:             19476 kB
You could guess from that information how long the fsfreeze may take (ideally, combining that with allowed IOPS).
Of course, if you have control over your VMs, you may also play with the vm.dirty_ratio and vm.dirty_background_ratio.

I have that data (from node_exporter), but looks like it was only a few kB peaking at about 3MB during the problem interval. The problem is that there's no way to tell how long flushing that is going to take without knowing the average I/O size required. 1MB contiguous will complete in negligible time, 1MB of 4kB random writes will take a few seconds. I do have access to the VMs, customer stuff runs higher in the stack.

Still, given the time it took to flush and the I/OS involved (looks like ~23kIOs during the time range of interest) makes me think there was something else involved than what the Dirty number accounts for. 23kIOs * 4kB (page size, worst case) is 94MB, which is definitely not what I had as Dirty. Perhaps it was dirty entires in the inode cache (which would explain the peak in Dirty as they were flushed to disk buffers first and then to disk).

Interestingly, tuned on CentOS 7 configures for a "virtual-guest" profile:
vm.dirty_ratio = 30
(default is 20 %) so they optimize for performance by increasing the dirty buffers to delay writeback even more.
They take the opposite for their "virtual-host" profile:
vm.dirty_background_ratio = 5
(default is 10 %).
I believe these choices are good for performance, but may increase the time it takes to freeze the VMs, especially if IOPS are limited and there's a lot of dirty data.

Yeah, I may need to try playing with some of those settings if this becomes a further problem in the future. FWIW our hosts and VMs are both Ubuntu 16.04.

Since we also have 1 Gbps links and HDD OSDs, and plan to add more and more VMs and hosts, we may also observe this one day...
So I'm curious:
How did you implement the timeout in your case? Are you using a qemu-agent-command issuing fsfreeze with --async and --timeout instead of domfsfreeze?
We are using domfsfreeze as of now, which (probably) has an infinite timeout, or at least no timeout documented in the manpage.

We have a wrapper to take the snapshots, and it just uses domfsfreeze and times out and kills the command if it takes too long. Unsurprisingly, that doesn't abort the freeze, so libvirt just ends up running it in the background (with a lock taken, so domfsthaw doesn't work until that completes).

The logic I have right now actually tries several times to thaw the filesystems and if it doesn't succeed it resets the VM to avoid leaving it in a frozen state. However, I had a logic bug where if the freeze itself timed out it did not do that (assuming the VM wasn't frozen), when in this case it was just freeze taking a while. That leaves the VM frozen and broken. I'll probably add some alerting to complain loudly when this happens, increase the thaw timeout/retries, then switch to unconditionally reset the VM if thawing fails.

Ultimately this whole thing is kind of fragile, so if I can get away without freezing at all it would probably make the whole process a lot more robust.

--
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://marcan.st/marcan.asc
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux