For what it worth, we are using snapshots on a daily basis for a couple of thousands rbd volume for some times So far so good, we have not catched any issue On 12/18/2018 10:28 AM, Oliver Freyermuth wrote: > Dear Hector, > > we are using the very same approach on CentOS 7 (freeze + thaw), but > preceeded by an fstrim. With virtio-scsi, using fstrim propagates the > discards from within the VM to Ceph RBD (if qemu is configured > accordingly), > and a lot of space is saved. > > We have yet to observe these hangs, we are running this with ~5 VMs with > ~10 disks for about half a year now with daily snapshots. But all of > these VMs have very "low" I/O, > since we put anything I/O intensive on bare metal (but with automated > provisioning of course). > > So I'll chime in on your question, especially since there might be VMs > on our cluster in the future where the inner OS may not be running an > agent. > Since we did not observe this yet, I'll also add: What's your "scale", > is it hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs? > > Cheers, > Oliver > > Am 18.12.18 um 10:10 schrieb Hector Martin: >> Hi list, >> >> I'm running libvirt qemu guests on RBD, and currently taking backups >> by issuing a domfsfreeze, taking a snapshot, and then issuing a >> domfsthaw. This seems to be a common approach. >> >> This is safe, but it's impactful: the guest has frozen I/O for the >> duration of the snapshot. This is usually only a few seconds. >> Unfortunately, the freeze action doesn't seem to be very reliable. >> Sometimes it times out, leaving the guest in a messy situation with >> frozen I/O (thaw times out too when this happens, or returns success >> but FSes end up frozen anyway). This is clearly a bug somewhere, but I >> wonder whether the freeze is a hard requirement or not. >> >> Are there any atomicity guarantees for RBD snapshots taken *without* >> freezing the filesystem? Obviously the filesystem will be dirty and >> will require journal recovery, but that is okay; it's equivalent to a >> hard shutdown/crash. But is there any chance of corruption related to >> the snapshot being taken in a non-atomic fashion? Filesystems and >> applications these days should have no trouble with hard shutdowns, as >> long as storage writes follow ordering guarantees (no writes getting >> reordered across a barrier and such). >> >> Put another way: do RBD snapshots have ~identical atomicity guarantees >> to e.g. LVM snapshots? >> >> If we can get away without the freeze, honestly I'd rather go that >> route. If I really need to pause I/O during the snapshot creation, I >> might end up resorting to pausing the whole VM (suspend/resume), which >> has higher impact but also probably a much lower chance of messing up >> (or having excess latency), since it doesn't involve the guest OS or >> the qemu agent at all... >> > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com