Quoting by morphin (morphinwithyou@xxxxxxxxx): > After 72 hours I believe we may hit a bug. Any help would be greatly > appreciated. Is it feasible for you to stop all client IO to the Ceph cluster? At least until it stabilizes again. "ceph osd pause" would do the trick (ceph osd unpause would unset it). What kind of workload are you running on the cluster? How does your crush map looks like (ceph osd getcrushmap -o /tmp/crush_raw; crushtool -d /tmp/crush_raw -o /tmp/crush_edit)? I have seen a (test) Ceph cluster "healing" itself to the point there was nothing left to recover on. In *that* case the disks were overbooked (multiple OSDs per physical disk) ... The flags you set (nooout, nodown, nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover again. I would try to get all OSDs online again (and manually keep them up / restart them, because you have set nodown). Does the cluster recover at all? Gr. Stefan -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx