Re: Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

Wido den Hollander <wido@xxxxxxxx> · Fri, 15 Nov 2019 13:17:40 +0100

On 11/15/19 11:24 AM, Simon Ironside wrote:
> Hi Florian,
> 
> Any chance the key your compute nodes are using for the RBD pool is
> missing 'allow command "osd blacklist"' from its mon caps?
> 

Added to this I recommend to use the 'profile rbd' for the mon caps.

As also stated in the OpenStack docs:
https://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication

Wido

> Simon
> 
> On 15/11/2019 08:19, Florian Haas wrote:
>> Hi everyone,
>>
>> I'm trying to wrap my head around an issue we recently saw, as it
>> relates to RBD locks, Qemu/KVM, and libvirt.
>>
>> Our data center graced us with a sudden and complete dual-feed power
>> failure that affected both a Ceph cluster (Luminous, 12.2.12), and
>> OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these
>> things really happen, even in 2019.)
>>
>> Once nodes were powered back up, the Ceph cluster came up gracefully
>> with no intervention required — all we saw was some Mon clock skew until
>> NTP peers had fully synced. Yay! However, our Nova compute nodes, or
>> rather the libvirt VMs that were running on them, were in not so great a
>> shape. The VMs booted up fine initially, but then blew up as soon as
>> they were trying to write to their RBD-backed virtio devices — which, of
>> course, was very early in the boot sequence as they had dirty filesystem
>> journals to apply.
>>
>> Being able to read from, but not write to, RBDs is usually an issue with
>> exclusive locking, so we stopped one of the affected VMs, checked the
>> RBD locks on its device, and found (with rbd lock ls) that the lock was
>> still being held even after the VM was definitely down — both "openstack
>> server show" and "virsh domstate" agreed on this. We manually cleared
>> the lock (rbd lock rm), started the VM, and it booted up fine.
>>
>> Repeat for all VMs, and we were back in business.
>>
>> If I understand correctly, image locks — in contrast to image watchers —
>> have no timeout, so locks must be always be explicitly released, or they
>> linger forever.
>>
>> So that raises a few questions:
>>
>> (1) Is it correct to assume that the lingering lock was actually from
>> *before* the power failure?
>>
>> (2) What, exactly, triggers the lock acquisition and release in this
>> context? Is it nova-compute that does this, or libvirt, or Qemu/KVM?
>>
>> (3) Would the same issue be expected essentially in any hard failure of
>> even a single compute node, and if so, does that mean that what
>> https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova
>> evacuate" (and presumably, by extension also about "nova host-evacuate")
>> is inaccurate? If so, what would be required to make that work?
>>
>> (4) If (3), is it correct to assume that the same considerations apply
>> to the Nova resume_guests_state_on_host_boot feature, i.e. that
>> automatic guest recovery wouldn't be expected to succeed even if a node
>> experienced just a hard reboot, as opposed to a a catastrophic permanent
>> failure? And again, what would be required to make that work?  Is it
>> really necessary to clean all RBD locks manually?
>>
>> Grateful for any insight that people could share here. I'd volunteer to
>> add a brief writeup of locking functionality in this context to the docs.
>>
>> Thanks!
>>
>> Cheers,
>> Florian
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com