Hello, As far as I know and can tell, you're doing everything that is possible for having a least impact OSD rebuild/replacement. If your cluster is still strongly, adversely impacted by this gradual and throttled approach, how about the following things: 1. Does scrub or deep_scrub also impact your performance so that your applications notice it? 2. Are there times when other cluster activity (like reboots or installs of new VMs, other large data movements created by clients) impacts your applications? If both or either of these are true, your cluster is at the limit of its capacity. And in general, a rebuild with throttled parameters like yours (and many others, including me) should not hurt things. If it does, it's time to improve your cluster performance. 1. Adding journal SSDs if not present already. 2. Adding more OSDs in general. 3. Adding a cache tier, this is particular effective if your latency sensitive applications do small writes or reads that easily fit into the cache. I was in a similar situation with hundreds of VMs running an application that had latency sensitive small writes and adding a cache tier completely solved the problem. Regards, Christian On Tue, 10 May 2016 16:30:00 -0300 Agustín Trolli wrote: > Hello All, > I´m writing to you because i´m trying to find the way to rebuild a osd > disk in a way to don´t impact the performance of the cluster. > That´s because my applications are very latency sensitive. > > 1_ I found the way to reuse a OSD ID and don´t rebalance the cluster > every time that I lost a disk. > So, my cluster is running with the noout check forever. > The point here is do the disk change as fast I can. > > 2_ after reuse de OSD ID, I´m living the OSD up and running, but with > CERO weight. > For example: > > root@DC4-ceph03-dn03:/var/lib/ceph/osd/ceph-352# ceph osd tree | grep 352 > *352 1.81999 osd.352 up 0 > 1.00000* > > At this point everything is good. > > 3_ Starting the reweight, using "osd reweigh" i´m not touching the > crushmap, and I´m doing the reweight very gradually. > Example: > *ceph osd reweight 352 0.001* > > But, anyway doing the reweight in this way i´m heating the latency > sometimes. > Depending of the amount of PGs that the cluster is recovering the impact > is worst. > > Tunings that I already have done: > > ceph tell osd.* injectargs "--osd_max_backfills 1" > ceph tell osd.* injectargs "--osd_recovery_max_active 1" > ceph tell osd.* injectargs '--osd-max-recovery-threads 1' > ceph tell osd.* injectargs '--osd-recovery-op-priority 1' > ceph tell osd.* injectargs '--osd-client-op-priority 63' > > The question is, there are more parameters to change in order to do more > gradually the OSD rebuild? > > I really appreciate your help, thanks in advance. > > Agustin Trolli > Storage Team > Mercadolibre.com -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com