Re: [ceph-users] Mimic cluster is offline and not healing

by morphin <morphinwithyou@xxxxxxxxx> · Thu, 27 Sep 2018 16:27:20 +0300

I should not have client I/O right now. All of my VMs are down right
now. There is only a single pool.

Here is my crush map: https://paste.ubuntu.com/p/Z9G5hSdqCR/

Cluster does not recover. After starting OSDs with the specified
flags, OSD up count drops from 168 to 50 with in 24 hours.
Stefan Kooman <stefan@xxxxxx>, 27 Eyl 2018 Per, 16:10 tarihinde şunu yazdı:
>
> Quoting by morphin (morphinwithyou@xxxxxxxxx):
> > After 72 hours I believe we may hit a bug. Any help would be greatly
> > appreciated.
>
> Is it feasible for you to stop all client IO to the Ceph cluster? At
> least until it stabilizes again. "ceph osd pause" would do the trick
> (ceph osd unpause would unset it).
>
> What kind of workload are you running on the cluster? How does your
> crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw;
> crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?
>
> I have seen a (test) Ceph cluster "healing" itself to the point there was
> nothing left to recover on. In *that* case the disks were overbooked
> (multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
> nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
> again. I would try to get all OSDs online again (and manually keep them
> up / restart them, because you have set nodown).
>
> Does the cluster recover at all?
>
> Gr. Stefan
>
> --
> | BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
> | GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx