On Sat, Nov 4, 2023, 6:44 AM Matthew Booth <mbooth@xxxxxxxxxx> wrote: > I have a 3 node ceph cluster in my home lab. One of the pools spans 3 > hdds, one on each node, and has size 2, min size 1. One of my nodes is > currently down, and I have 160 pgs in 'unknown' state. The other 2 > hosts are up and the cluster has quorum. > > Example `ceph health detail` output: > pg 9.0 is stuck inactive for 25h, current state unknown, last acting [] > > I have 3 questions: > > Why would the pgs be in an unknown state? > No quick answer to this, unfortunately. Try `ceph pg map 9.0` and looking at it alongside the output of `ceph osd tree`. Do you have device classes/CRUSH rules or anything that you were tinkering with? Did the OSD that failed get marked out? Do you have an active mgr? Does `ceph health detail` indicate anything else being a problem? > I would like to recover the cluster without recovering the failed > node, primarily so that I know I can. Is that possible? > > The boot nvme of the host has failed, so I will most likely rebuild > it. I'm running rook, and I will most likely delete the old node and > create a new one with the same name. AFAIK, the OSDs are fine. When > rook rediscovers the OSDs, will it add them back with data intact? If > not, is there any way I can make it so it will? > Assuming you used standard tools/playbooks, mostly everything just shoves Ceph OSDs onto an LVM partition. As long as you do leave the LVM partition alone, you can just tell Ceph to scan the LVM for metadata and "activate" it again (in Ceph parlance) as another user mentioned. Happy homelabbing! > Thanks! > -- > Matthew Booth > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx