Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

Florian Haas <florian@xxxxxxxxxxxxxx> · Fri, 15 Nov 2019 09:19:28 +0100

Hi everyone,

I'm trying to wrap my head around an issue we recently saw, as it
relates to RBD locks, Qemu/KVM, and libvirt.

Our data center graced us with a sudden and complete dual-feed power
failure that affected both a Ceph cluster (Luminous, 12.2.12), and
OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these
things really happen, even in 2019.)

Once nodes were powered back up, the Ceph cluster came up gracefully
with no intervention required — all we saw was some Mon clock skew until
NTP peers had fully synced. Yay! However, our Nova compute nodes, or
rather the libvirt VMs that were running on them, were in not so great a
shape. The VMs booted up fine initially, but then blew up as soon as
they were trying to write to their RBD-backed virtio devices — which, of
course, was very early in the boot sequence as they had dirty filesystem
journals to apply.

Being able to read from, but not write to, RBDs is usually an issue with
exclusive locking, so we stopped one of the affected VMs, checked the
RBD locks on its device, and found (with rbd lock ls) that the lock was
still being held even after the VM was definitely down — both "openstack
server show" and "virsh domstate" agreed on this. We manually cleared
the lock (rbd lock rm), started the VM, and it booted up fine.

Repeat for all VMs, and we were back in business.

If I understand correctly, image locks — in contrast to image watchers —
have no timeout, so locks must be always be explicitly released, or they
linger forever.

So that raises a few questions:

(1) Is it correct to assume that the lingering lock was actually from
*before* the power failure?

(2) What, exactly, triggers the lock acquisition and release in this
context? Is it nova-compute that does this, or libvirt, or Qemu/KVM?

(3) Would the same issue be expected essentially in any hard failure of
even a single compute node, and if so, does that mean that what
https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova
evacuate" (and presumably, by extension also about "nova host-evacuate")
is inaccurate? If so, what would be required to make that work?

(4) If (3), is it correct to assume that the same considerations apply
to the Nova resume_guests_state_on_host_boot feature, i.e. that
automatic guest recovery wouldn't be expected to succeed even if a node
experienced just a hard reboot, as opposed to a a catastrophic permanent
failure? And again, what would be required to make that work?  Is it
really necessary to clean all RBD locks manually?

Grateful for any insight that people could share here. I'd volunteer to
add a brief writeup of locking functionality in this context to the docs.

Thanks!

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com