We experienced this problem in the past on older (pre-Jewel) releases where a PG split that affected the RBD header object would result in the watch getting lost by librados. Any chance you know if the affected RBD header objects were involved in a PG split? Can you generate a gcore dump of one of the affected VMs and ceph-post-file it for analysis? As for the VM going R/O, that is the expected behavior when a client breaks the exclusive lock held by a (dead) client. On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander <wido@xxxxxxxx> wrote: > Hi, > > On a OpenStack environment I encountered a VM which went into R/O mode after a RBD snapshot was created. > > Digging into this I found 10s (out of thousands) RBD images which DO have a running VM, but do NOT have a watcher on the RBD image. > > For example: > > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086 > > 'Watchers: none' > > The VM is however running since September 5th 2017 with Jewel 10.2.7 on the client. > > In the meantime the cluster was already upgraded to 10.2.10 > > Looking further I also found a Compute node with 10.2.10 installed which also has RBD images without watchers. > > Restarting or live migrating the VM to a different host resolves this issue. > > The internet is full of posts where RBD images still have Watchers when people don't expect them, but in this case I'm expecting a watcher which isn't there. > > The main problem right now is that creating a snapshot potentially puts a VM in Read-Only state because of the lack of notification. > > Has anybody seen this as well? > > Thanks, > > Wido > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com