Re: RBD image has no active watchers while OpenStack KVM VM is running

Wido den Hollander <wido@xxxxxxxx> · Tue, 5 Dec 2017 15:36:28 +0100 (CET)

> Op 5 december 2017 om 15:27 schreef Jason Dillaman <jdillama@xxxxxxxxxx>:
> 
> 
> On Tue, Dec 5, 2017 at 9:13 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >
> >> Op 29 november 2017 om 14:56 schreef Jason Dillaman <jdillama@xxxxxxxxxx>:
> >>
> >>
> >> We experienced this problem in the past on older (pre-Jewel) releases
> >> where a PG split that affected the RBD header object would result in
> >> the watch getting lost by librados. Any chance you know if the
> >> affected RBD header objects were involved in a PG split? Can you
> >> generate a gcore dump of one of the affected VMs and ceph-post-file it
> >> for analysis?
> >>
> >
> > I asked again for the gcore, but they can't release it as it contains confidential information about the Instance and the Ceph cluster. I understand their reasoning and they also understand that it makes it difficult to debug this.
> >
> > I am allowed to look at the gcore dump when on location (next week), but I'm not allowed to share it.
> 
> Indeed -- best chance would be if you could reproduce on a VM that you
> are permitted to share.
> 

We are looking into that.

> >> As for the VM going R/O, that is the expected behavior when a client
> >> breaks the exclusive lock held by a (dead) client.
> >>
> >
> > We noticed another VM going into RO when a snapshot was created. When checking last week this Instance had a watcher, but after the snapshot/RO we found out it no longer has a watcher registered.
> >
> > Any suggestions or ideas?
> 
> If you have the admin socket enabled, you could run "ceph
> --admin-daemon /path/to/asok objecter_requests" to dump the ops. That
> probably won't be useful unless there is a smoking gun. Did you have
> any OSDs go out/down? Network issues?
> 

The admin socket is currently not enabled, but I will ask them to do that. We will then have to wait for this to happen again.

We didn't have any network issues there, but a few OSD went down and up again in the last few weeks, but not very recently afaik.

I'll look into the admin socket!

Wido

> > Wido
> >
> >> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >> > Hi,
> >> >
> >> > On a OpenStack environment I encountered a VM which went into R/O mode after a RBD snapshot was created.
> >> >
> >> > Digging into this I found 10s (out of thousands) RBD images which DO have a running VM, but do NOT have a watcher on the RBD image.
> >> >
> >> > For example:
> >> >
> >> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >> >
> >> > 'Watchers: none'
> >> >
> >> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on the client.
> >> >
> >> > In the meantime the cluster was already upgraded to 10.2.10
> >> >
> >> > Looking further I also found a Compute node with 10.2.10 installed which also has RBD images without watchers.
> >> >
> >> > Restarting or live migrating the VM to a different host resolves this issue.
> >> >
> >> > The internet is full of posts where RBD images still have Watchers when people don't expect them, but in this case I'm expecting a watcher which isn't there.
> >> >
> >> > The main problem right now is that creating a snapshot potentially puts a VM in Read-Only state because of the lack of notification.
> >> >
> >> > Has anybody seen this as well?
> >> >
> >> > Thanks,
> >> >
> >> > Wido
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@xxxxxxxxxxxxxx
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> 
> 
> 
> -- 
> Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com