(Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

seffyroff@xxxxxxxxx · Sat, 07 Nov 2020 00:14:06 -0000

I've inherited a Ceph Octopus cluster that seems like it needs urgent maintenance before data loss begins to happen. I'm the guy with the most Ceph experience on hand and that's not saying much. I'm experiencing most of the ops and repair tasks for the first time here.

Ceph health output looks like this:

HEALTH_WARN Degraded data redundancy: 3640401/8801868 objects degraded (41.359%),
 128 pgs degraded, 128 pgs undersized; 128 pgs not deep-scrubbed in time;
 128 pgs not scrubbed in time

Ceph -s output: https://termbin.com/i06u

The crush rule 'cephfs.media' is here: https://termbin.com/2klmq

So, it seems like all PGs are in a 'warning' state for the main pool, which is erasure coded and 11TiB across 4 OSDs, of which around 6.4TiB is used. The Ceph services themselves seem happy, they're stable and have Quorum. I'm able to access the web panel fine also.  The block devices are of different sizes and types (2 large, different sized spinners, and 2 identical SSDs)

I would welcome any pointers on what my steps to bring this up to full health may be.  If it's undersized, can I simply add another block device/OSD? Or perhaps adjusting config somewhere will get it to rebalance successfully? (the rebalance jobs have been stuck at 0% for weeks)

Thank you for your time reading this message.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx