Many pgs inactive after node failure

Matthew Booth <mbooth@xxxxxxxxxx> · Sat, 4 Nov 2023 10:44:16 +0000

I have a 3 node ceph cluster in my home lab. One of the pools spans 3
hdds, one on each node, and has size 2, min size 1. One of my nodes is
currently down, and I have 160 pgs in 'unknown' state. The other 2
hosts are up and the cluster has quorum.

Example `ceph health detail` output:
pg 9.0 is stuck inactive for 25h, current state unknown, last acting []

I have 3 questions:

Why would the pgs be in an unknown state?

I would like to recover the cluster without recovering the failed
node, primarily so that I know I can. Is that possible?

The boot nvme of the host has failed, so I will most likely rebuild
it. I'm running rook, and I will most likely delete the old node and
create a new one with the same name. AFAIK, the OSDs are fine. When
rook rediscovers the OSDs, will it add them back with data intact? If
not, is there any way I can make it so it will?

Thanks!
-- 
Matthew Booth
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx