Re: RBD Stuck Watcher

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 25 Jul 2024 12:11:06 +0200

On Wed, Jul 3, 2024 at 5:45 PM Reid Guyett <reid.guyett@xxxxxxxxx> wrote:
>
> Hi,
>
> I have a small script in a Docker container we use for a type of CRUD test
> to monitor availability. The script uses Python librbd/librados and is
> launched by Telegraf input.exec. It does the following:
>
>    1. Creates an rbd image
>    2. Writes a small amount of data to the rbd
>    3. Reads the data from the rbd
>    4. Deletes the rbd
>    5. Closes connections
>
> It works great for 99% of the time but there is a small chance that
> something happens and the script takes too long (1 min) to complete and it
> is killed. I don't have logging to know which step it happens at yet but
> will be adding some. Regardless when the script is killed, sometimes the
> watcher on the rbd isn't going away. I use the same RBD name for each test
> and try to clean up the rbd if it exists prior to starting the next test
> but when the watcher is stuck, it can't.
>
> The only way to cleanup the watcher is to restart the primary osd for the
> rbd_header. Blocklist and restarting the container free the watcher.
>
> When I look at the status of the image I can see the watcher.
> # rbd -p pool status crud-image
> Watchers:
> watcher=<ipaddr>:0/3587274006 client.1053762394 cookie=140375838755648
>
> Lookup up primary OSD
> # rbd -p pool info crud-image | grep id
> id: cf235ae95099cb
> # ceph osd map pool rbd_header.cf235ae95099cb
> osdmap e332984 pool 'pool' (1) object 'rbd_header.cf235ae95099cb' -> pg
> 1.a76f353e (1.53e) -> up ([7,66,176], p7) acting ([7,66,176], p7)
>
> Checking watchers on primary OSD does NOT list rbd_header.cf235ae95099cb
> # ceph tell osd.7 dump_watchers
> [
>     {
>         "namespace": "",
>         "object": "rbd_header.70fa4f9b5c2cf8",
>         "entity_name": {
>             "type": "client",
>             "num": 998139266
>         },
>         "cookie": 140354859197312,
>         "timeout": 30,
>         "entity_addr_t": {
>             "type": "v1",
>             "addr": "<ipaddr>:0",
>             "nonce": 2665188958
>         }
>     }
> ]
>
> Is this a bug somewhere? I expect that if my script is killed it's watcher
> should die out within a minute. New runs of the script would result in new
> watcher/client/cookie ids.

Hi Reid,

You might be hitting https://tracker.ceph.com/issues/58120.  It looks
like the ticket wasn't moved to the appropriate state when the fix got
merged, so unfortunately the fix isn't available in any of the stable
releases -- only in 19.1.0 (release candidate for squid).  I have just
tweaked the ticket and will stage backport PRs shortly.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx