1. First scenario, only 4 node scenario and since it is chassis level
replication single node remaining on the chassis taking all the traffic.
It seems that is a bottleneck as for the host level replication on the
similar setup recovery time is much less (data is not in this table).
2. In the second scenario , I kept everything else same but doubled the
node/chassis. Recovery time is also half.
3. For the third scenario, increased cluster data and also now I have
doubled the number of OSDs in the cluster (since each drive size is 4TB
now). Recovery time came down further.
4. Moved to Jewel keeping everything else same, got further improvement.
Mostly because of improved write performance in jewel (?).
5. Last scenario is interesting. I got improved recovery speed than any
other scenario with this WPQ. Degraded PG % came down to 2% within 3
hours , ~0.6% within 4 hours and 15 min , but *last 0.6% took ~4 hours*
hurting overall time for recovery.
6. In fact, this long tail latency is hurting the overall recovery time
for every other scenarios. Related tracker I found is
http://tracker.ceph.com/issues/15763
Any feedback much appreciated. We can discuss this in tomorrow’s
performance call if needed.
Hi Somnath,
Thanks for these! Interesting results. Did you have a client load
going at the same time as recovery? It would be interesting to know how
client IO performance was affected in each case. Too bad about the long
tail on WPQ. I wonder if the long tail is consistently higher with WPQ
or it just happened to be higher in that test.
Anyway, thanks for the results! Glad to see the recovery time in
general is lower in hammer.
Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com