Tonight an old Ceph cluster we run suffered a hardware failure that
resulted in the loss of Ceph journal SSDs on 7 nodes out of 36. Overview
of this old setup:
- Super-old Ceph Dumpling v0.67
- 3x replication for RBD w/ 3 failure domains in replication hierarchy
- OSDs on XFS on spinning disks with Journals on SSD
In total we lost 7 SSDs hosting Journals for 21 OSDs (3 each). The lost
nodes span all three failure domains which makes me nervous that there
are likely missing Placement Groups in the pool. Due to how Ceph shards
data across the Placement Groups, I'm concerned I may have lost all the
RBD volumes in this pool.
The obvious solution is to attempt to bring the OSDs back online (for at
least one failure domain) to ensure there is at least one complete copy
of the data then rebuild everything else. The issue is I lost the
journals when the SSDs died.
I don't see much published about recovering OSDs in the event of a lost
journal except:
https://ceph.io/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/
And that doesn't mention if the data is valid afterwards. I think I
recall Inktank used to deal with this situation and may have had a
potential solution. At this point, I'll take any constructive advice.
Thanks you in advance,
Mike
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx