Re: RBD Stuck Watcher

Reid Guyett <reid.guyett@xxxxxxxxx> · Tue, 30 Jul 2024 09:18:02 -0400

Hi,
It sounds similar. How would I best be able to confirm it? Logs? Which
log/message if so?
Thanks

On Thu, Jul 25, 2024 at 6:11 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:

> On Wed, Jul 3, 2024 at 5:45 PM Reid Guyett <reid.guyett@xxxxxxxxx> wrote:
> >
> > Hi,
> >
> > I have a small script in a Docker container we use for a type of CRUD
> test
> > to monitor availability. The script uses Python librbd/librados and is
> > launched by Telegraf input.exec. It does the following:
> >
> >    1. Creates an rbd image
> >    2. Writes a small amount of data to the rbd
> >    3. Reads the data from the rbd
> >    4. Deletes the rbd
> >    5. Closes connections
> >
> > It works great for 99% of the time but there is a small chance that
> > something happens and the script takes too long (1 min) to complete and
> it
> > is killed. I don't have logging to know which step it happens at yet but
> > will be adding some. Regardless when the script is killed, sometimes the
> > watcher on the rbd isn't going away. I use the same RBD name for each
> test
> > and try to clean up the rbd if it exists prior to starting the next test
> > but when the watcher is stuck, it can't.
> >
> > The only way to cleanup the watcher is to restart the primary osd for the
> > rbd_header. Blocklist and restarting the container free the watcher.
> >
> > When I look at the status of the image I can see the watcher.
> > # rbd -p pool status crud-image
> > Watchers:
> > watcher=<ipaddr>:0/3587274006 client.1053762394 cookie=140375838755648
> >
> > Lookup up primary OSD
> > # rbd -p pool info crud-image | grep id
> > id: cf235ae95099cb
> > # ceph osd map pool rbd_header.cf235ae95099cb
> > osdmap e332984 pool 'pool' (1) object 'rbd_header.cf235ae95099cb' -> pg
> > 1.a76f353e (1.53e) -> up ([7,66,176], p7) acting ([7,66,176], p7)
> >
> > Checking watchers on primary OSD does NOT list rbd_header.cf235ae95099cb
> > # ceph tell osd.7 dump_watchers
> > [
> >     {
> >         "namespace": "",
> >         "object": "rbd_header.70fa4f9b5c2cf8",
> >         "entity_name": {
> >             "type": "client",
> >             "num": 998139266
> >         },
> >         "cookie": 140354859197312,
> >         "timeout": 30,
> >         "entity_addr_t": {
> >             "type": "v1",
> >             "addr": "<ipaddr>:0",
> >             "nonce": 2665188958
> >         }
> >     }
> > ]
> >
> > Is this a bug somewhere? I expect that if my script is killed it's
> watcher
> > should die out within a minute. New runs of the script would result in
> new
> > watcher/client/cookie ids.
>
> Hi Reid,
>
> You might be hitting https://tracker.ceph.com/issues/58120.  It looks
> like the ticket wasn't moved to the appropriate state when the fix got
> merged, so unfortunately the fix isn't available in any of the stable
> releases -- only in 19.1.0 (release candidate for squid).  I have just
> tweaked the ticket and will stage backport PRs shortly.
>
> Thanks,
>
>                 Ilya
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx