RBD Stuck Watcher

Reid Guyett <reid.guyett@xxxxxxxxx> · Wed, 3 Jul 2024 11:45:03 -0400

Hi,

I have a small script in a Docker container we use for a type of CRUD test
to monitor availability. The script uses Python librbd/librados and is
launched by Telegraf input.exec. It does the following:

   1. Creates an rbd image
   2. Writes a small amount of data to the rbd
   3. Reads the data from the rbd
   4. Deletes the rbd
   5. Closes connections

It works great for 99% of the time but there is a small chance that
something happens and the script takes too long (1 min) to complete and it
is killed. I don't have logging to know which step it happens at yet but
will be adding some. Regardless when the script is killed, sometimes the
watcher on the rbd isn't going away. I use the same RBD name for each test
and try to clean up the rbd if it exists prior to starting the next test
but when the watcher is stuck, it can't.

The only way to cleanup the watcher is to restart the primary osd for the
rbd_header. Blocklist and restarting the container free the watcher.

When I look at the status of the image I can see the watcher.
# rbd -p pool status crud-image
Watchers:
watcher=<ipaddr>:0/3587274006 client.1053762394 cookie=140375838755648

Lookup up primary OSD
# rbd -p pool info crud-image | grep id
id: cf235ae95099cb
# ceph osd map pool rbd_header.cf235ae95099cb
osdmap e332984 pool 'pool' (1) object 'rbd_header.cf235ae95099cb' -> pg
1.a76f353e (1.53e) -> up ([7,66,176], p7) acting ([7,66,176], p7)

Checking watchers on primary OSD does NOT list rbd_header.cf235ae95099cb
# ceph tell osd.7 dump_watchers
[
    {
        "namespace": "",
        "object": "rbd_header.70fa4f9b5c2cf8",
        "entity_name": {
            "type": "client",
            "num": 998139266
        },
        "cookie": 140354859197312,
        "timeout": 30,
        "entity_addr_t": {
            "type": "v1",
            "addr": "<ipaddr>:0",
            "nonce": 2665188958
        }
    }
]

Is this a bug somewhere? I expect that if my script is killed it's watcher
should die out within a minute. New runs of the script would result in new
watcher/client/cookie ids.

Thanks!

Reid
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx