Just to add, assuming other settings are default, IOPS and maximum physical write speed are probably not the actual limiting factors in the tests you have been doing; ceph by default limits recovery I/O on any given OSD quite a bit in order to ensure recovery operations don't adversely impact client I/O too much. You can experiment with the osd_max_backfills, osd_recovery_max_active and osd_recovery_sleep[_hdd,_ssd,_hybrid] family of settings to tune recovery speed. You can probably make it a lot faster, but you will probably still see the discrepancy; fundamentally you're still comparing the difference between 30 workers shuffling around 10% of your data to 2 workers taking on about 10% of your data. Rich On 27/09/17 14:23, David Turner wrote: > When you lose 2 osds you have 30 osds accepting the degraded data and performing the backfilling. When the 2 osds are added back in you only have 2 osds receiving the majority of the data from the backfilling. 2 osds have a lot less available iops and spindle speed than the other 30 did when they were recovering from the loss causing your bottleneck. > > Adding osds is generally a slower operation than losing them due to this. Even for brand-new nodes increasing your cluster size. > > > On Wed, Sep 27, 2017, 8:43 AM Jonas Jaszkowic <jonasjaszkowic.work@xxxxxxxxx <mailto:jonasjaszkowic.work@xxxxxxxxx>> wrote: > > Hello all, > > I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD of size 320GB per host) and 16 clients which are reading > and writing to the cluster. I have one erasure coded pool (shec plugin) with k=8, m=4, c=3 and pg_num=256. Failure domain is host. > I am able to reach a HEALTH_OK state and everything is working as expected. The pool was populated with > 114048 files of different sizes ranging from 1kB to 4GB. Total amount of data in the pool was around 3TB. The capacity of the > pool was around 10TB. > > I want to evaluate how Ceph is rebalancing data when > > 1) I take out two OSDs and > 2) when I rejoin these two OSDS. > > For scenario 1) I am „killing" two OSDs via *ceph osd out <osd-id>. *Ceph notices this failure and starts to rebalance data until I > reach HEALTH_OK again. > > For scenario 2) I am rejoining the previously killed OSDs via *ceph osd in <osd-id>. *Again, Ceph notices this failure and starts to > rebalance data until HEALTH_OK state. > > I repeated this whole scenario four times. *What I am noticing is that the rebalancing process in the event of two OSDs joining the* > *cluster takes more than 3 times longer than in the event of the loss of two OSDs. This was consistent over the four runs.* > > I expected both recovering times to be equally long since at both scenarios the number of degraded objects was around 8% and the > number of missing objects around 2%. I attached a visualization of the recovery process in terms of degraded and missing objects, > first part is the scenario where two OSDs „failed“, second one is the rejoining of these two OSDs. Note how it takes significantly longer > to recover in the second case. > > Now I want to understand why it takes longer! I appreciate all hints. > > Thanks! > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Richard Hesketh Systems Engineer, Research Platforms BBC Research & Development
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com