On Sat, Apr 15, 2023 at 4:58 PM Max Boone <max@xxxxxxxxxx> wrote: > > > After a critical node failure on my lab cluster, which won't come > back up and is still down, the RBD objects are still being watched > / mounted according to ceph. I can't shell to the node to rbd unbind > them as the node is down. I am absolutely certain that nothing is > using these images and they don't have snapshots either (and this IP > is not even remotely close to the those of the monitors in the > cluster). I blocked the IP usingceph osd blocklist add but after 30 > minutes, they are still being watched. Them being watched (they are > RWO ceph-csi volumes) prevents me from re-using them in the cluster. > As far as I'm aware, ceph should remove the watchers after 30 minutes > and they've been blocklisted for hours now. Hi Max, A couple of general points: - watch timeout is 30 seconds, not 30 minutes - watcher IP doesn't have to match that of any of the monitors > root@node0:~# rbd status kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff > Watchers: > watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280 > root@node0:~# rbd snap list kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff > root@node0:~# rbd info kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff > rbd image 'csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff': > size 10 GiB in 2560 objects > order 22 (4 MiB objects) > snapshot_count: 0 > id: 4ff5353b865e1 > block_name_prefix: rbd_data.4ff5353b865e1 > format: 2 > features: layering > op_features: > flags: > create_timestamp: Fri Mar 31 14:46:51 2023 > access_timestamp: Fri Mar 31 14:46:51 2023 > modify_timestamp: Fri Mar 31 14:46:51 2023 > root@node0:~# rados -p kubernetes listwatchers rbd_header.4ff5353b865e1 > watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280 > root@node0:~# ceph osd blocklist ls > 10.0.0.103:0/0 2023-04-16T13:58:34.854232+0200 > listed 1 entries > root@node0:~# ceph daemon osd.0 config get osd_client_watch_timeout > { > "osd_client_watch_timeout": "30" > } > > Is it possible to kick a watcher out manually, or is there not much > I can do here besides shutting down the entire cluster (or OSDs) and > getting them back up? If it is a bug, I'm happy to help figuring out > it's root cause and see if I can help writing a fix. Cheers, Max. You may have hit https://tracker.ceph.com/issues/58120. Try restarting the OSD that is holding the header object. To determine the OSD, run "ceph osd map kubernetes rbd_header.4ff5353b865e1". The output should end with something like "acting ([X, Y, Z], pX)", where X, Y and Z are numbers. X is the OSD you want to restart. Thanks, Ilya _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx