Ceph rbd clients surrender exclusive lock in critical situation

Frank Schilder <frans@xxxxxx> · Wed, 18 Jan 2023 12:17:02 +0000

Hi all,

we are observing a problem on a libvirt virtualisation cluster that might come from ceph rbd clients. Something went wrong during execution of a live-migration operation and as a result we have two instances of the same VM running on 2 different hosts, the source- and the destination host. What we observe now is the the exclusive lock of the RBD disk image moves between these two clients periodically (every few minutes the owner flips).

We are pretty sure that no virsh commands possibly having that effect are executed during this time. The client connections are not lost and the OSD blacklist is empty. I don't understand why a ceph rbd client would surrender an exclusive lock in such a split brain situation, its exactly when it needs to hold on to it. As a result, the affected virtual drives are corrupted.

The questions we have in this context are:

Under what conditions does a ceph rbd client surrender an exclusive lock?
Could this be a bug in the client or a ceph config error?
Is this a known problem with libceph and libvirtd?
Anyone else making the same observation and having some guidance?

The VM hosts are on alma8 and we use the advanced virtualisation repo providing very recent versions of qemu and libvirtd. We have seen this floating exclusive lock before on mimic. Now we are on octopus and I can't really blame it on the old ceph version any more. We use opennebula as a KVM front-end.

Thanks for any pointers!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx