On 11/15/19 11:24 AM, Simon Ironside wrote: > Hi Florian, > > Any chance the key your compute nodes are using for the RBD pool is > missing 'allow command "osd blacklist"' from its mon caps? > Added to this I recommend to use the 'profile rbd' for the mon caps. As also stated in the OpenStack docs: https://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication Wido > Simon > > On 15/11/2019 08:19, Florian Haas wrote: >> Hi everyone, >> >> I'm trying to wrap my head around an issue we recently saw, as it >> relates to RBD locks, Qemu/KVM, and libvirt. >> >> Our data center graced us with a sudden and complete dual-feed power >> failure that affected both a Ceph cluster (Luminous, 12.2.12), and >> OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these >> things really happen, even in 2019.) >> >> Once nodes were powered back up, the Ceph cluster came up gracefully >> with no intervention required — all we saw was some Mon clock skew until >> NTP peers had fully synced. Yay! However, our Nova compute nodes, or >> rather the libvirt VMs that were running on them, were in not so great a >> shape. The VMs booted up fine initially, but then blew up as soon as >> they were trying to write to their RBD-backed virtio devices — which, of >> course, was very early in the boot sequence as they had dirty filesystem >> journals to apply. >> >> Being able to read from, but not write to, RBDs is usually an issue with >> exclusive locking, so we stopped one of the affected VMs, checked the >> RBD locks on its device, and found (with rbd lock ls) that the lock was >> still being held even after the VM was definitely down — both "openstack >> server show" and "virsh domstate" agreed on this. We manually cleared >> the lock (rbd lock rm), started the VM, and it booted up fine. >> >> Repeat for all VMs, and we were back in business. >> >> If I understand correctly, image locks — in contrast to image watchers — >> have no timeout, so locks must be always be explicitly released, or they >> linger forever. >> >> So that raises a few questions: >> >> (1) Is it correct to assume that the lingering lock was actually from >> *before* the power failure? >> >> (2) What, exactly, triggers the lock acquisition and release in this >> context? Is it nova-compute that does this, or libvirt, or Qemu/KVM? >> >> (3) Would the same issue be expected essentially in any hard failure of >> even a single compute node, and if so, does that mean that what >> https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova >> evacuate" (and presumably, by extension also about "nova host-evacuate") >> is inaccurate? If so, what would be required to make that work? >> >> (4) If (3), is it correct to assume that the same considerations apply >> to the Nova resume_guests_state_on_host_boot feature, i.e. that >> automatic guest recovery wouldn't be expected to succeed even if a node >> experienced just a hard reboot, as opposed to a a catastrophic permanent >> failure? And again, what would be required to make that work? Is it >> really necessary to clean all RBD locks manually? >> >> Grateful for any insight that people could share here. I'd volunteer to >> add a brief writeup of locking functionality in this context to the docs. >> >> Thanks! >> >> Cheers, >> Florian >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com