The config didnt work. Because increasing the number faced with more OSD Drops. bhfs -s cluster: id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106 health: HEALTH_ERR norebalance,norecover flag(s) set 1 osds down 17/8839434 objects unfound (0.000%) Reduced data availability: 3578 pgs inactive, 861 pgs down, 1928 pgs peering, 11 pgs stale Degraded data redundancy: 44853/17678868 objects degraded (0.254%), 221 pgs degraded, 20 pgs undersized 610 slow requests are blocked > 32 sec 3996 stuck requests are blocked > 4096 sec 6076 slow ops, oldest one blocked for 4129 sec, daemons [osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]... have slow ops. services: mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3 mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3 osd: 168 osds: 128 up, 129 in; 2 remapped pgs flags norebalance,norecover data: pools: 1 pools, 4096 pgs objects: 8.84 M objects, 17 TiB usage: 26 TiB used, 450 TiB / 477 TiB avail pgs: 0.024% pgs unknown 89.160% pgs not active 44853/17678868 objects degraded (0.254%) 17/8839434 objects unfound (0.000%) 1612 peering 720 down 583 activating 319 stale+peering 255 active+clean 157 stale+activating 108 stale+down 95 activating+degraded 84 stale+active+clean 50 active+recovery_wait+degraded 29 creating+down 23 stale+activating+degraded 18 stale+active+recovery_wait+degraded 14 active+undersized+degraded 12 active+recovering+degraded 4 stale+creating+down 3 stale+active+recovering+degraded 3 stale+active+undersized+degraded 2 stale 1 active+recovery_wait+undersized+degraded 1 active+clean+scrubbing+deep 1 unknown 1 active+undersized+degraded+remapped+backfilling 1 active+recovering+undersized+degraded I guess OSD down and drop issue increases the recovery time. So I decided to try with decreasing recovery parameters for less load on cluster. I have Nvme and SAS disks. Servers are powerfull enough. Network is 4x10Gb. I dont think my cluster is a bad shape. Because I have datacenter redundancy (14 servers + 14 servers). The crashed 7 servers are on only datacenter A. And it took only a few minutes to back online. Also 2 of them is monitors and cluster I/O should be suspended so there should be less data difference. On the other hand I dont understand the burden of recovery. I have faced many recoverys but none of the stopped my cluster working. This recovery burden is so high that it didnt stop for hours. I wish I could just decrease the recovery speed and continue to serve my VMs. Is the change of recovery load some what different than mimic? Luminous was pretty fine indeed. by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 13:57 tarihinde şunu yazdı: > > Thank you for answer > > What do you think the conf for speed the recover? > > [osd] > osd recovery op priority = 63 > osd client op priority = 1 > osd recovery max active = 16 > osd max scrubs = 16 > <admin@xxxxxxxxxxxxxxx> adresine sahip kullanıcı 25 Eyl 2018 Sal, > 13:37 tarihinde şunu yazdı: > > > > Just let it recover. > > > > data: > > pools: 1 pools, 4096 pgs > > objects: 8.95 M objects, 17 TiB > > usage: 34 TiB used, 577 TiB / 611 TiB avail > > pgs: 94.873% pgs not active > > 48475/17901254 objects degraded (0.271%) > > 1/8950627 objects unfound (0.000%) > > 2631 peering > > 637 activating > > 562 down > > 159 active+clean > > 44 activating+degraded > > 30 active+recovery_wait+degraded > > 12 activating+undersized+degraded > > 10 active+recovering+degraded > > 10 active+undersized+degraded > > 1 active+clean+scrubbing+deep > > > > You've got deep scrubbed PGs which put considerable IO load on OSDs. > > > > > > September 25, 2018 1:23 PM, "by morphin" <morphinwithyou@xxxxxxxxx> wrote: > > > > > > > What should I do now? > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com