Different recovery times for OSDs joining and leaving the cluster

Jonas Jaszkowic <jonasjaszkowic.work@xxxxxxxxx> · Wed, 27 Sep 2017 14:43:24 +0200

Hello all, 
I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD of size 320GB per host) and 16 clients which are reading
and writing to the cluster. I have one erasure coded pool (shec plugin) with k=8, m=4, c=3 and pg_num=256. Failure domain is host.
I am able to reach a HEALTH_OK state and everything is working as expected. The pool was populated with
114048 files of different sizes ranging from 1kB to 4GB. Total amount of data in the pool was around 3TB. The capacity of the
pool was around 10TB.

I want to evaluate how Ceph is rebalancing data when 

1) I take out two OSDs and 
2) when I rejoin these two OSDS.

For scenario 1) I am „killing" two OSDs via ceph osd out <osd-id>. Ceph notices this failure and starts to rebalance data until I 
reach HEALTH_OK again.

For scenario 2) I am rejoining the previously killed OSDs via ceph osd in <osd-id>. Again, Ceph notices this failure and starts to 
rebalance data until HEALTH_OK state.

I repeated this whole scenario four times. What I am noticing is that the rebalancing process in the event of two OSDs joining the
cluster takes more than 3 times longer than in the event of the loss of two OSDs. This was consistent over the four runs.

I expected both recovering times to be equally long since at both scenarios the number of degraded objects was around 8% and the
number of missing objects around 2%. I attached a visualization of the recovery process in terms of degraded and missing objects, 
first part is the scenario where two OSDs „failed“, second one is the rejoining of these two OSDs. Note how it takes significantly longer
to recover in the second case.

Now I want to understand why it takes longer! I appreciate all hints.

Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com