I should not have client I/O right now. All of my VMs are down right now. There is only a single pool. Here is my crush map: https://paste.ubuntu.com/p/Z9G5hSdqCR/ Cluster does not recover. After starting OSDs with the specified flags, OSD up count drops from 168 to 50 with in 24 hours. Stefan Kooman <stefan@xxxxxx>, 27 Eyl 2018 Per, 16:10 tarihinde şunu yazdı: > > Quoting by morphin (morphinwithyou@xxxxxxxxx): > > After 72 hours I believe we may hit a bug. Any help would be greatly > > appreciated. > > Is it feasible for you to stop all client IO to the Ceph cluster? At > least until it stabilizes again. "ceph osd pause" would do the trick > (ceph osd unpause would unset it). > > What kind of workload are you running on the cluster? How does your > crush map looks like (ceph osd getcrushmap -o /tmp/crush_raw; > crushtool -d /tmp/crush_raw -o /tmp/crush_edit)? > > I have seen a (test) Ceph cluster "healing" itself to the point there was > nothing left to recover on. In *that* case the disks were overbooked > (multiple OSDs per physical disk) ... The flags you set (nooout, nodown, > nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover > again. I would try to get all OSDs online again (and manually keep them > up / restart them, because you have set nodown). > > Does the cluster recover at all? > > Gr. Stefan > > -- > | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com