Hey, don't lose hope. I just went through 2 3-5 day outages after a mimic upgrade with no data loss. I'd recommend looking through the thread about it to see how close it is to your issue. From my point of view there seems to be some similarities.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029649.html. At a similar point of desperation with my cluster I would shut all ceph processes down and bring them up in order. Doing this had my cluster almost healthy a few times until it fell over again due to mon issues. So solving any mon issues is the first priority. It seems like you may also benefit from setting mon_osd_cache_size to a very large number if you have enough memory on your mon servers. I'll hop on the irc today. Kevin On 09/25/2018 05:53 PM, by morphin wrote:
After I tried too many things with so many helps on IRC. My pool health is still in ERROR and I think I can't recover from this. https://paste.ubuntu.com/p/HbsFnfkYDT/ At the end 2 of 3 mons crashed and started at same time and the pool is offlined. Recovery takes more than 12hours and it is way too slow. Somehow recovery seems to be not working. If I can reach my data I will re-create the pool easily. If I run ceph-object-tool script to regenerate mon store.db can I acccess the RBD pool again? by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 20:03 tarihinde şunu yazdı:Hi, Cluster is still down :( Up to not we have managed to compensate the OSDs. 118s of 160 OSD are stable and cluster is still in the progress of settling. Thanks for the guy Be-El in the ceph IRC channel. Be-El helped a lot to make flapping OSDs stable. What we learned up now is that this is the cause of unsudden death of 2 monitor servers of 3. And when they come back if they do not start one by one (each after joining cluster) this can happen. Cluster can be unhealty and it can take countless hour to come back. Right now here is our status: ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/ health detail: https://paste.ubuntu.com/p/w4gccnqZjR/ Since OSDs disks are NL-SAS it can take up to 24 hours for an online cluster. What is most it has been said that we could be extremely luck if all the data is rescued. Most unhappily our strategy is just to sit and wait :(. As soon as the peering and activating count drops to 300-500 pgs we will restart the stopped OSDs one by one. For each OSD and we will wait the cluster to settle down. The amount of data stored is OSD is 33TB. Our most concern is to export our rbd pool data outside to a backup space. Then we will start again with clean one. I hope to justify our analysis with an expert. Any help or advise would be greatly appreciated. by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 15:08 tarihinde şunu yazdı:After reducing the recovery parameter values did not change much. There are a lot of OSD still marked down. I don't know what I need to do after this point. [osd] osd recovery op priority = 63 osd client op priority = 1 osd recovery max active = 1 osd max scrubs = 1 ceph -s cluster: id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106 health: HEALTH_ERR 42 osds down 1 host (6 osds) down 61/8948582 objects unfound (0.001%) Reduced data availability: 3837 pgs inactive, 1822 pgs down, 1900 pgs peering, 6 pgs stale Possible data damage: 18 pgs recovery_unfound Degraded data redundancy: 457246/17897164 objects degraded (2.555%), 213 pgs degraded, 209 pgs undersized 2554 slow requests are blocked > 32 sec 3273 slow ops, oldest one blocked for 1453 sec, daemons [osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]... have slow ops. services: mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3 mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3, SRV-SEKUARK4 osd: 168 osds: 118 up, 160 in data: pools: 1 pools, 4096 pgs objects: 8.95 M objects, 17 TiB usage: 33 TiB used, 553 TiB / 586 TiB avail pgs: 93.677% pgs not active 457246/17897164 objects degraded (2.555%) 61/8948582 objects unfound (0.001%) 1676 down 1372 peering 528 stale+peering 164 active+undersized+degraded 145 stale+down 73 activating 40 active+clean 29 stale+activating 17 active+recovery_unfound+undersized+degraded 16 stale+active+clean 16 stale+active+undersized+degraded 9 activating+undersized+degraded 3 active+recovery_wait+degraded 2 activating+undersized 2 activating+degraded 1 creating+down 1 stale+active+recovery_unfound+undersized+degraded 1 stale+active+clean+scrubbing+deep 1 stale+active+recovery_wait+degraded ceph -w: https://paste.ubuntu.com/p/WZ2YqzS86S/ ceph health detail: https://paste.ubuntu.com/p/8w7Jpms8fj/ by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 14:32 tarihinde şunu yazdı:The config didnt work. Because increasing the number faced with more OSD Drops. bhfs -s cluster: id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106 health: HEALTH_ERR norebalance,norecover flag(s) set 1 osds down 17/8839434 objects unfound (0.000%) Reduced data availability: 3578 pgs inactive, 861 pgs down, 1928 pgs peering, 11 pgs stale Degraded data redundancy: 44853/17678868 objects degraded (0.254%), 221 pgs degraded, 20 pgs undersized 610 slow requests are blocked > 32 sec 3996 stuck requests are blocked > 4096 sec 6076 slow ops, oldest one blocked for 4129 sec, daemons [osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]... have slow ops. services: mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3 mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3 osd: 168 osds: 128 up, 129 in; 2 remapped pgs flags norebalance,norecover data: pools: 1 pools, 4096 pgs objects: 8.84 M objects, 17 TiB usage: 26 TiB used, 450 TiB / 477 TiB avail pgs: 0.024% pgs unknown 89.160% pgs not active 44853/17678868 objects degraded (0.254%) 17/8839434 objects unfound (0.000%) 1612 peering 720 down 583 activating 319 stale+peering 255 active+clean 157 stale+activating 108 stale+down 95 activating+degraded 84 stale+active+clean 50 active+recovery_wait+degraded 29 creating+down 23 stale+activating+degraded 18 stale+active+recovery_wait+degraded 14 active+undersized+degraded 12 active+recovering+degraded 4 stale+creating+down 3 stale+active+recovering+degraded 3 stale+active+undersized+degraded 2 stale 1 active+recovery_wait+undersized+degraded 1 active+clean+scrubbing+deep 1 unknown 1 active+undersized+degraded+remapped+backfilling 1 active+recovering+undersized+degraded I guess OSD down and drop issue increases the recovery time. So I decided to try with decreasing recovery parameters for less load on cluster. I have Nvme and SAS disks. Servers are powerfull enough. Network is 4x10Gb. I dont think my cluster is a bad shape. Because I have datacenter redundancy (14 servers + 14 servers). The crashed 7 servers are on only datacenter A. And it took only a few minutes to back online. Also 2 of them is monitors and cluster I/O should be suspended so there should be less data difference. On the other hand I dont understand the burden of recovery. I have faced many recoverys but none of the stopped my cluster working. This recovery burden is so high that it didnt stop for hours. I wish I could just decrease the recovery speed and continue to serve my VMs. Is the change of recovery load some what different than mimic? Luminous was pretty fine indeed. by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 13:57 tarihinde şunu yazdı:Thank you for answer What do you think the conf for speed the recover? [osd] osd recovery op priority = 63 osd client op priority = 1 osd recovery max active = 16 osd max scrubs = 16 <admin@xxxxxxxxxxxxxxx> adresine sahip kullanıcı 25 Eyl 2018 Sal, 13:37 tarihinde şunu yazdı:Just let it recover. data: pools: 1 pools, 4096 pgs objects: 8.95 M objects, 17 TiB usage: 34 TiB used, 577 TiB / 611 TiB avail pgs: 94.873% pgs not active 48475/17901254 objects degraded (0.271%) 1/8950627 objects unfound (0.000%) 2631 peering 637 activating 562 down 159 active+clean 44 activating+degraded 30 active+recovery_wait+degraded 12 activating+undersized+degraded 10 active+recovering+degraded 10 active+undersized+degraded 1 active+clean+scrubbing+deep You've got deep scrubbed PGs which put considerable IO load on OSDs. September 25, 2018 1:23 PM, "by morphin" <morphinwithyou@xxxxxxxxx> wrote:What should I do now?_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com