Dead node (watcher) won't timeout on RBD

max@xxxxxxxxxx · Sat, 15 Apr 2023 10:31:03 -0000

Hey all,

I recently had a k8s node failure in my homelab, and even though I powered it off (and it's done for, so it won't get back up), it still shows up as watcher in rbd status.

```
root@node0:~# rbd status kubernetes/csi-vol-3e7af8ae-ceb6-4c94-8435-2f8dc29b313b
Watchers:
	watcher=10.0.0.103:0/1520114202 client.1697844 cookie=140289402510784
	watcher=10.0.0.103:0/39967552 client.1805496 cookie=140549449430704
root@node0:~# ceph osd blocklist ls
10.0.0.103:0/0 2023-04-15T13:15:39.061379+0200
listed 1 entries
```

Even though the node is down & I have blocked it multiple times for hours, it won't disappear. Meaning, ceph-csi-rbd claims the image is mounted already (manually binding works fine, and can cleanly unbind as well, but can't unbind from a node that doesn't exist anymore).

Is there any possibility to force kick an rbd client / watcher from ceph (e.g. switching the mgr / mon) or to see why this is not timing out?

I found some historical mails & issues (related to rook, which I don't use) regarding a param `osd_client_watch_timeout` but can't find how that relates to the RBD images.

Cheers,
Max.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx