Re: Weighted Priority Queue testing

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 11 May 2016 07:16:08 -0500

1. First scenario, only 4 node scenario and since it is chassis level
replication single node remaining on the chassis taking all the traffic.
It seems that is a bottleneck as for the host level replication on the
similar setup recovery time is much less (data is not in this table).

2. In the second scenario , I kept everything else same but doubled the
node/chassis. Recovery time is also half.

3.  For the third scenario, increased cluster data and also now I have
doubled the number of  OSDs in the cluster (since each drive size is 4TB
now). Recovery time came down further.

4. Moved to Jewel keeping everything else same, got further improvement.
Mostly because of improved write performance in jewel (?).

5. Last scenario is interesting. I got improved recovery speed than any
other scenario with this WPQ. Degraded PG % came down to 2% within 3
hours , ~0.6% within 4 hours and 15 min , but *last 0.6% took ~4 hours*
hurting overall time for recovery.

6. In fact, this long tail latency is hurting the overall recovery time
for every other scenarios. Related tracker I found is
http://tracker.ceph.com/issues/15763

Any feedback much appreciated. We can discuss this in tomorrow’s
performance call if needed.

Hi Somnath,

Thanks for these!  Interesting results.  Did you have a client load 
going at the same time as recovery?  It would be interesting to know how 
client IO performance was affected in each case.  Too bad about the long 
tail on WPQ.  I wonder if the long tail is consistently higher with WPQ 
or it just happened to be higher in that test.

Anyway, thanks for the results!  Glad to see the recovery time in 
general is lower in hammer.

Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com