Re: RBD image has no active watchers while OpenStack KVM VM is running

Wido den Hollander <wido@xxxxxxxx> · Thu, 30 Nov 2017 15:22:03 +0100 (CET)

> Op 30 november 2017 om 14:19 schreef Jason Dillaman <jdillama@xxxxxxxxxx>:
> 
> 
> On Thu, Nov 30, 2017 at 4:00 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >
> >> Op 29 november 2017 om 14:56 schreef Jason Dillaman <jdillama@xxxxxxxxxx>:
> >>
> >>
> >> We experienced this problem in the past on older (pre-Jewel) releases
> >> where a PG split that affected the RBD header object would result in
> >> the watch getting lost by librados. Any chance you know if the
> >> affected RBD header objects were involved in a PG split? Can you
> >> generate a gcore dump of one of the affected VMs and ceph-post-file it
> >> for analysis?
> >>
> >
> > There was no PG splitting in the recent months on this cluster, so that's not something that might have happened here.
> 
> Possible alternative explanation: are you using cache tiering?

No, not either. It's running 3x replication. Standard RBD behind OpenStack.

Cluster has around 2.000 OSDs running all with 4TB disks and 3x replication.

I'll wait for the gcore dump of a running VM, but that may take a few days.

Wido

> 
> > I've asked the OpenStack team for a gcore dump, but they have to get that cleared before they can send it to me.
> >
> > This might take a bit of time!
> >
> > Wido
> >
> >> As for the VM going R/O, that is the expected behavior when a client
> >> breaks the exclusive lock held by a (dead) client.
> >>
> >> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >> > Hi,
> >> >
> >> > On a OpenStack environment I encountered a VM which went into R/O mode after a RBD snapshot was created.
> >> >
> >> > Digging into this I found 10s (out of thousands) RBD images which DO have a running VM, but do NOT have a watcher on the RBD image.
> >> >
> >> > For example:
> >> >
> >> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >> >
> >> > 'Watchers: none'
> >> >
> >> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on the client.
> >> >
> >> > In the meantime the cluster was already upgraded to 10.2.10
> >> >
> >> > Looking further I also found a Compute node with 10.2.10 installed which also has RBD images without watchers.
> >> >
> >> > Restarting or live migrating the VM to a different host resolves this issue.
> >> >
> >> > The internet is full of posts where RBD images still have Watchers when people don't expect them, but in this case I'm expecting a watcher which isn't there.
> >> >
> >> > The main problem right now is that creating a snapshot potentially puts a VM in Read-Only state because of the lack of notification.
> >> >
> >> > Has anybody seen this as well?
> >> >
> >> > Thanks,
> >> >
> >> > Wido
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@xxxxxxxxxxxxxx
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> 
> 
> 
> -- 
> Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com