Re: Different recovery times for OSDs joining and leaving the cluster

Richard Hesketh <richard.hesketh@xxxxxxxxxxxx> · Wed, 27 Sep 2017 14:36:43 +0100

Just to add, assuming other settings are default, IOPS and maximum physical write speed are probably not the actual limiting factors in the tests you have been doing; ceph by default limits recovery I/O on any given OSD quite a bit in order to ensure recovery operations don't adversely impact client I/O too much. You can experiment with the osd_max_backfills, osd_recovery_max_active and osd_recovery_sleep[_hdd,_ssd,_hybrid] family of settings to tune recovery speed. You can probably make it a lot faster, but you will probably still see the discrepancy; fundamentally you're still comparing the difference between 30 workers shuffling around 10% of your data to 2 workers taking on about 10% of your data.

Rich

On 27/09/17 14:23, David Turner wrote:
> When you lose 2 osds you have 30 osds accepting the degraded data and performing the backfilling. When the 2 osds are added back in you only have 2 osds receiving the majority of the data from the backfilling.  2 osds have a lot less available iops and spindle speed than the other 30 did when they were recovering from the loss causing your bottleneck.
> 
> Adding osds is generally a slower operation than losing them due to this.  Even for brand-new nodes increasing your cluster size.
> 
> 
> On Wed, Sep 27, 2017, 8:43 AM Jonas Jaszkowic <jonasjaszkowic.work@xxxxxxxxx <mailto:jonasjaszkowic.work@xxxxxxxxx>> wrote:
> 
>     Hello all, 
> 
>     I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD of size 320GB per host) and 16 clients which are reading
>     and writing to the cluster. I have one erasure coded pool (shec plugin) with k=8, m=4, c=3 and pg_num=256. Failure domain is host.
>     I am able to reach a HEALTH_OK state and everything is working as expected. The pool was populated with
>     114048 files of different sizes ranging from 1kB to 4GB. Total amount of data in the pool was around 3TB. The capacity of the
>     pool was around 10TB.
> 
>     I want to evaluate how Ceph is rebalancing data when 
> 
>     1) I take out two OSDs and 
>     2) when I rejoin these two OSDS.
> 
>     For scenario 1) I am „killing" two OSDs via *ceph osd out <osd-id>. *Ceph notices this failure and starts to rebalance data until I 
>     reach HEALTH_OK again.
> 
>     For scenario 2) I am rejoining the previously killed OSDs via *ceph osd in <osd-id>. *Again, Ceph notices this failure and starts to 
>     rebalance data until HEALTH_OK state.
> 
>     I repeated this whole scenario four times. *What I am noticing is that the rebalancing process in the event of two OSDs joining the*
>     *cluster takes more than 3 times longer than in the event of the loss of two OSDs. This was consistent over the four runs.*
> 
>     I expected both recovering times to be equally long since at both scenarios the number of degraded objects was around 8% and the
>     number of missing objects around 2%. I attached a visualization of the recovery process in terms of degraded and missing objects, 
>     first part is the scenario where two OSDs „failed“, second one is the rejoining of these two OSDs. Note how it takes significantly longer
>     to recover in the second case.
> 
>     Now I want to understand why it takes longer! I appreciate all hints.
> 
>     Thanks!
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Richard Hesketh
Systems Engineer, Research Platforms
BBC Research & Development

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com