Re: Dead node (watcher) won't timeout on RBD

Ilya Dryomov <idryomov@xxxxxxxxx> · Sun, 16 Apr 2023 18:04:29 +0200

On Sat, Apr 15, 2023 at 4:58 PM Max Boone <max@xxxxxxxxxx> wrote:
>
>
> After a critical node failure on my lab cluster, which won't come
> back up and is still down, the RBD objects are still being watched
> / mounted according to ceph. I can't shell to the node to rbd unbind
> them as the node is down. I am absolutely certain that nothing is
> using these images and they don't have snapshots either (and this IP
> is not even remotely close to the those of the monitors in the
> cluster). I blocked the IP usingceph osd blocklist add but after 30
> minutes, they are still being watched. Them being watched (they are
> RWO ceph-csi volumes) prevents me from re-using them in the cluster.
> As far as I'm aware, ceph should remove the watchers after 30 minutes
> and they've been blocklisted for hours now.

Hi Max,

A couple of general points:

- watch timeout is 30 seconds, not 30 minutes
- watcher IP doesn't have to match that of any of the monitors

> root@node0:~# rbd status kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
> Watchers:
>         watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
> root@node0:~# rbd snap list kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
> root@node0:~# rbd info kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
> rbd image 'csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff':
>         size 10 GiB in 2560 objects
>         order 22 (4 MiB objects)
>         snapshot_count: 0
>         id: 4ff5353b865e1
>         block_name_prefix: rbd_data.4ff5353b865e1
>         format: 2
>         features: layering
>         op_features:
>         flags:
>         create_timestamp: Fri Mar 31 14:46:51 2023
>         access_timestamp: Fri Mar 31 14:46:51 2023
>         modify_timestamp: Fri Mar 31 14:46:51 2023
> root@node0:~# rados -p kubernetes listwatchers rbd_header.4ff5353b865e1
> watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
> root@node0:~# ceph osd blocklist ls
> 10.0.0.103:0/0 2023-04-16T13:58:34.854232+0200
> listed 1 entries
> root@node0:~# ceph daemon osd.0 config get osd_client_watch_timeout
> {
>     "osd_client_watch_timeout": "30"
> }
>
> Is it possible to kick a watcher out manually, or is there not much
> I can do here besides shutting down the entire cluster (or OSDs) and
> getting them back up? If it is a bug, I'm happy to help figuring out
> it's root cause and see if I can help writing a fix. Cheers, Max.

You may have hit https://tracker.ceph.com/issues/58120.

Try restarting the OSD that is holding the header object.  To determine
the OSD, run "ceph osd map kubernetes rbd_header.4ff5353b865e1".  The
output should end with something like "acting ([X, Y, Z], pX)", where X,
Y and Z are numbers.  X is the OSD you want to restart.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx