I've inherited a Ceph Octopus cluster that seems like it needs urgent maintenance before data loss begins to happen. I'm the guy with the most Ceph experience on hand and that's not saying much. I'm experiencing most of the ops and repair tasks for the first time here. Ceph health output looks like this: HEALTH_WARN Degraded data redundancy: 3640401/8801868 objects degraded (41.359%), 128 pgs degraded, 128 pgs undersized; 128 pgs not deep-scrubbed in time; 128 pgs not scrubbed in time Ceph -s output: https://termbin.com/i06u The crush rule 'cephfs.media' is here: https://termbin.com/2klmq So, it seems like all PGs are in a 'warning' state for the main pool, which is erasure coded and 11TiB across 4 OSDs, of which around 6.4TiB is used. The Ceph services themselves seem happy, they're stable and have Quorum. I'm able to access the web panel fine also. The block devices are of different sizes and types (2 large, different sized spinners, and 2 identical SSDs) I would welcome any pointers on what my steps to bring this up to full health may be. If it's undersized, can I simply add another block device/OSD? Or perhaps adjusting config somewhere will get it to rebalance successfully? (the rebalance jobs have been stuck at 0% for weeks) Thank you for your time reading this message. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx