Hi everyone, I'm trying to wrap my head around an issue we recently saw, as it relates to RBD locks, Qemu/KVM, and libvirt. Our data center graced us with a sudden and complete dual-feed power failure that affected both a Ceph cluster (Luminous, 12.2.12), and OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these things really happen, even in 2019.) Once nodes were powered back up, the Ceph cluster came up gracefully with no intervention required — all we saw was some Mon clock skew until NTP peers had fully synced. Yay! However, our Nova compute nodes, or rather the libvirt VMs that were running on them, were in not so great a shape. The VMs booted up fine initially, but then blew up as soon as they were trying to write to their RBD-backed virtio devices — which, of course, was very early in the boot sequence as they had dirty filesystem journals to apply. Being able to read from, but not write to, RBDs is usually an issue with exclusive locking, so we stopped one of the affected VMs, checked the RBD locks on its device, and found (with rbd lock ls) that the lock was still being held even after the VM was definitely down — both "openstack server show" and "virsh domstate" agreed on this. We manually cleared the lock (rbd lock rm), started the VM, and it booted up fine. Repeat for all VMs, and we were back in business. If I understand correctly, image locks — in contrast to image watchers — have no timeout, so locks must be always be explicitly released, or they linger forever. So that raises a few questions: (1) Is it correct to assume that the lingering lock was actually from *before* the power failure? (2) What, exactly, triggers the lock acquisition and release in this context? Is it nova-compute that does this, or libvirt, or Qemu/KVM? (3) Would the same issue be expected essentially in any hard failure of even a single compute node, and if so, does that mean that what https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova evacuate" (and presumably, by extension also about "nova host-evacuate") is inaccurate? If so, what would be required to make that work? (4) If (3), is it correct to assume that the same considerations apply to the Nova resume_guests_state_on_host_boot feature, i.e. that automatic guest recovery wouldn't be expected to succeed even if a node experienced just a hard reboot, as opposed to a a catastrophic permanent failure? And again, what would be required to make that work? Is it really necessary to clean all RBD locks manually? Grateful for any insight that people could share here. I'd volunteer to add a brief writeup of locking functionality in this context to the docs. Thanks! Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com