Lost Journals for XFS OSDs

Mike Dawson <mike.dawson@xxxxxxxxxxxx> · Thu, 9 Jul 2020 20:57:20 -0400

Tonight an old Ceph cluster we run suffered a hardware failure that 
resulted in the loss of Ceph journal SSDs on 7 nodes out of 36. Overview 
of this old setup:

- Super-old Ceph Dumpling v0.67
- 3x replication for RBD w/ 3 failure domains in replication hierarchy
- OSDs on XFS on spinning disks with Journals on SSD

In total we lost 7 SSDs hosting Journals for 21 OSDs (3 each). The lost 
nodes span all three failure domains which makes me nervous that there 
are likely missing Placement Groups in the pool. Due to how Ceph shards 
data across the Placement Groups, I'm concerned I may have lost all the 
RBD volumes in this pool.

The obvious solution is to attempt to bring the OSDs back online (for at 
least one failure domain) to ensure there is at least one complete copy 
of the data then rebuild everything else. The issue is I lost the 
journals when the SSDs died.

I don't see much published about recovering OSDs in the event of a lost 
journal except:

https://ceph.io/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/

And that doesn't mention if the data is valid afterwards. I think I 
recall Inktank used to deal with this situation and may have had a 
potential solution. At this point, I'll take any constructive advice.

Thanks you in advance,
Mike
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx