Re: [ceph-users] Mimic cluster is offline and not healing

Stefan Kooman <stefan@xxxxxx> · Thu, 27 Sep 2018 15:10:43 +0200

Quoting by morphin (morphinwithyou@xxxxxxxxx):
> After 72 hours I believe we may hit a bug. Any help would be greatly
> appreciated.

Is it feasible for you to stop all client IO to the Ceph cluster? At
least until it stabilizes again. "ceph osd pause" would do the trick
(ceph osd unpause would unset it). 

What kind of workload are you running on the cluster? How does your
crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw; 
crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?

I have seen a (test) Ceph cluster "healing" itself to the point there was
nothing left to recover on. In *that* case the disks were overbooked
(multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
again. I would try to get all OSDs online again (and manually keep them
up / restart them, because you have set nodown).

Does the cluster recover at all?

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx