Re: CEPH complete cluster failure: unknown PGS

Eugen Block <eblock@xxxxxx> · Thu, 05 Oct 2023 06:33:58 +0000

Hi,

were you able to recover your cluster or is this still an issue?
What exactly do you mean by this?

It generally fails to recover in the middle and starts from scratch.

Are OSDs "flapping" or are there other issues as well? Please provide  
more details what exactly happens.
There are a couple of troubleshooting threads on this list which could  
help. First thing I would try is to set the "nodown" flag ('ceph osd  
set nodown') to prevent OSDs from being marked down, this has been  
quite useful in the past. Are the OSD nodes running out of resources?

Zitat von v1tnam@xxxxxxxxx:

I have an 8-node cluster with old hardware. a week ago 4 nodes went  
down and the CEPH cluster went nuts.
All pgs became unknown and montors took too long to be in sync.
So i reduced the number of mons to one and mgrs to one as well

Now the recovery starts with 100% unknown pgs and then pgs start to  
move ot inactive . It generally fails to recover in the middle and  
starts from scratch.

It's hold hardware and OSDs have lots of slow ops and probably  
number of bad sectors as well

Any suggestions on how to tackle this. It's a nautilus cluster and  
pretty old (8-year old hardware)

Thanks
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx