After reducing the recovery parameter values did not change much. There are a lot of OSD still marked down. I don't know what I need to do after this point. [osd] osd recovery op priority = 63 osd client op priority = 1 osd recovery max active = 1 osd max scrubs = 1 ceph -s cluster: id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106 health: HEALTH_ERR 42 osds down 1 host (6 osds) down 61/8948582 objects unfound (0.001%) Reduced data availability: 3837 pgs inactive, 1822 pgs down, 1900 pgs peering, 6 pgs stale Possible data damage: 18 pgs recovery_unfound Degraded data redundancy: 457246/17897164 objects degraded (2.555%), 213 pgs degraded, 209 pgs undersized 2554 slow requests are blocked > 32 sec 3273 slow ops, oldest one blocked for 1453 sec, daemons [osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]... have slow ops. services: mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3 mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3, SRV-SEKUARK4 osd: 168 osds: 118 up, 160 in data: pools: 1 pools, 4096 pgs objects: 8.95 M objects, 17 TiB usage: 33 TiB used, 553 TiB / 586 TiB avail pgs: 93.677% pgs not active 457246/17897164 objects degraded (2.555%) 61/8948582 objects unfound (0.001%) 1676 down 1372 peering 528 stale+peering 164 active+undersized+degraded 145 stale+down 73 activating 40 active+clean 29 stale+activating 17 active+recovery_unfound+undersized+degraded 16 stale+active+clean 16 stale+active+undersized+degraded 9 activating+undersized+degraded 3 active+recovery_wait+degraded 2 activating+undersized 2 activating+degraded 1 creating+down 1 stale+active+recovery_unfound+undersized+degraded 1 stale+active+clean+scrubbing+deep 1 stale+active+recovery_wait+degraded ceph -w: https://paste.ubuntu.com/p/WZ2YqzS86S/ ceph health detail: https://paste.ubuntu.com/p/8w7Jpms8fj/ by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 14:32 tarihinde şunu yazdı: > > The config didnt work. Because increasing the number faced with more OSD Drops. > > bhfs -s > cluster: > id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106 > health: HEALTH_ERR > norebalance,norecover flag(s) set > 1 osds down > 17/8839434 objects unfound (0.000%) > Reduced data availability: 3578 pgs inactive, 861 pgs > down, 1928 pgs peering, 11 pgs stale > Degraded data redundancy: 44853/17678868 objects degraded > (0.254%), 221 pgs degraded, 20 pgs undersized > 610 slow requests are blocked > 32 sec > 3996 stuck requests are blocked > 4096 sec > 6076 slow ops, oldest one blocked for 4129 sec, daemons > [osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]... > have slow ops. > > services: > mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3 > mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3 > osd: 168 osds: 128 up, 129 in; 2 remapped pgs > flags norebalance,norecover > > data: > pools: 1 pools, 4096 pgs > objects: 8.84 M objects, 17 TiB > usage: 26 TiB used, 450 TiB / 477 TiB avail > pgs: 0.024% pgs unknown > 89.160% pgs not active > 44853/17678868 objects degraded (0.254%) > 17/8839434 objects unfound (0.000%) > 1612 peering > 720 down > 583 activating > 319 stale+peering > 255 active+clean > 157 stale+activating > 108 stale+down > 95 activating+degraded > 84 stale+active+clean > 50 active+recovery_wait+degraded > 29 creating+down > 23 stale+activating+degraded > 18 stale+active+recovery_wait+degraded > 14 active+undersized+degraded > 12 active+recovering+degraded > 4 stale+creating+down > 3 stale+active+recovering+degraded > 3 stale+active+undersized+degraded > 2 stale > 1 active+recovery_wait+undersized+degraded > 1 active+clean+scrubbing+deep > 1 unknown > 1 active+undersized+degraded+remapped+backfilling > 1 active+recovering+undersized+degraded > > I guess OSD down and drop issue increases the recovery time. So I > decided to try with decreasing recovery parameters for less load on > cluster. > I have Nvme and SAS disks. Servers are powerfull enough. Network is 4x10Gb. > I dont think my cluster is a bad shape. Because I have datacenter > redundancy (14 servers + 14 servers). The crashed 7 servers are on > only datacenter A. And it took only a few minutes to back online. Also > 2 of them is monitors and cluster I/O should be suspended so there > should be less data difference. > > On the other hand I dont understand the burden of recovery. I have > faced many recoverys but none of the stopped my cluster working. This > recovery burden is so high that it didnt stop for hours. I wish I > could just decrease the recovery speed and continue to serve my VMs. > Is the change of recovery load some what different than mimic? > Luminous was pretty fine indeed. > by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 13:57 > tarihinde şunu yazdı: > > > > Thank you for answer > > > > What do you think the conf for speed the recover? > > > > [osd] > > osd recovery op priority = 63 > > osd client op priority = 1 > > osd recovery max active = 16 > > osd max scrubs = 16 > > <admin@xxxxxxxxxxxxxxx> adresine sahip kullanıcı 25 Eyl 2018 Sal, > > 13:37 tarihinde şunu yazdı: > > > > > > Just let it recover. > > > > > > data: > > > pools: 1 pools, 4096 pgs > > > objects: 8.95 M objects, 17 TiB > > > usage: 34 TiB used, 577 TiB / 611 TiB avail > > > pgs: 94.873% pgs not active > > > 48475/17901254 objects degraded (0.271%) > > > 1/8950627 objects unfound (0.000%) > > > 2631 peering > > > 637 activating > > > 562 down > > > 159 active+clean > > > 44 activating+degraded > > > 30 active+recovery_wait+degraded > > > 12 activating+undersized+degraded > > > 10 active+recovering+degraded > > > 10 active+undersized+degraded > > > 1 active+clean+scrubbing+deep > > > > > > You've got deep scrubbed PGs which put considerable IO load on OSDs. > > > > > > > > > September 25, 2018 1:23 PM, "by morphin" <morphinwithyou@xxxxxxxxx> wrote: > > > > > > > > > > What should I do now? > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com