Re: Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 19 Nov 2019 16:19:14 -0500

On Tue, Nov 19, 2019 at 4:09 PM Florian Haas <florian@xxxxxxxxxxxxxx> wrote:
>
> On 19/11/2019 21:32, Jason Dillaman wrote:
> >> What, exactly, is the "reasonably configured hypervisor" here, in other
> >> words, what is it that grabs and releases this lock? It's evidently not
> >> Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what
> >> magic in there makes this happen, and what "reasonable configuration"
> >> influences this?
> >
> > librbd and krbd perform this logic when the exclusive-lock feature is
> > enabled.
>
> Right. So the "reasonable configuration" applies to the features they
> enable when they *create* an image, rather than what they do to the
> image at runtime. Is that fair to say?

The exclusive-lock ownership is enforced at image use (i.e. when the
feature is a property of the image, not specifically just during the
action of enabling the property) -- so this implies "what they do to
the image at runtime"

> > In this case, librbd sees that the previous lock owner is
> > dead / missing, but before it can steal the lock (since librbd did not
> > cleanly close the image), it needs to ensure it cannot come back from
> > the dead to issue future writes against the RBD image by blacklisting
> > it from the cluster.
>
> Thanks. I'm probably sounding dense here, sorry for that, but yes, this
> makes perfect sense to me when I want to fence a whole node off —
> however, how exactly does this work with VM recovery in place?

How would librbd / krbd know under what situation a VM was being
"recovered"? Should librbd be expected to integrate w/ IPMI devices
where the VM is being run or w/ Zabbix alert monitoring to know that
this was a power failure so don't expect that the lock owner will come
back up? The safe and generic thing for librbd / krbd to do in this
situation is to just blacklist the old lock owner to ensure it cannot
talk to the cluster. Obviously in the case of a physically failed
node, that won't ever happen -- but I think we can all agree this is
the sane recovery path that covers all bases.

> From further upthread:
>
> > Semi-relatedly, as I understand it OSD blacklisting happens based either
> > on an IP address, or on a socket address (IP:port). While this comes in
> > handy in host evacuation, it doesn't in in-place recovery (see question
> > 4 in my original message).
> >
> > - If the blacklist happens based on IP address alone (and that's what
> > seems to be what the client attempts to be doing, based on our log
> > messages), then it would break recovery-in-place after a hard reboot
> > altogether.
> >
> > - Even if the client would blacklist based on an address:port pair, it
> > would be just very unlikely that an RBD client used the same source port
> > to connect after the node recovers in place, but not impossible.
>
> Clearly though, if people set their permissions correctly then this
> blacklisting seems to work fine even for recovery-in-place, so no reason
> for me to doubt that, I'd just really like to understand the mechanics. :)

Yup, with the correct permissions librbd / rbd will be able to
blacklist the lock owner, break the old lock, and acquire the lock
themselves for R/W operations -- and the operator would not need to
intervene.

> Thanks again!
>
> Cheers,
> Florian
>

-- 
Jason

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com