> -----Original Message----- > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > Sent: 11 May 2016 13:16 > To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Nick Fisk > <nick@xxxxxxxxxx>; Ben England <bengland@xxxxxxxxxx>; Kyle Bader > <kbader@xxxxxxxxxx> > Cc: Sage Weil <sweil@xxxxxxxxxx>; Samuel Just <sjust@xxxxxxxxxx>; ceph- > users@xxxxxxxxxxxxxx > Subject: Re: Weighted Priority Queue testing > > > 1. First scenario, only 4 node scenario and since it is chassis level > > replication single node remaining on the chassis taking all the traffic. > > It seems that is a bottleneck as for the host level replication on the > > similar setup recovery time is much less (data is not in this table). > > > > > > > > 2. In the second scenario , I kept everything else same but doubled > > the node/chassis. Recovery time is also half. > > > > > > > > 3. For the third scenario, increased cluster data and also now I have > > doubled the number of OSDs in the cluster (since each drive size is > > 4TB now). Recovery time came down further. > > > > > > > > 4. Moved to Jewel keeping everything else same, got further > improvement. > > Mostly because of improved write performance in jewel (?). > > > > > > > > 5. Last scenario is interesting. I got improved recovery speed than > > any other scenario with this WPQ. Degraded PG % came down to 2% within > > 3 hours , ~0.6% within 4 hours and 15 min , but *last 0.6% took ~4 > > hours* hurting overall time for recovery. > > > > 6. In fact, this long tail latency is hurting the overall recovery > > time for every other scenarios. Related tracker I found is > > http://tracker.ceph.com/issues/15763 > > > > > > > > Any feedback much appreciated. We can discuss this in tomorrow’s > > performance call if needed. > > Hi Somnath, > > Thanks for these! Interesting results. Did you have a client load going at the > same time as recovery? It would be interesting to know how client IO > performance was affected in each case. Too bad about the long tail on WPQ. > I wonder if the long tail is consistently higher with WPQ or it just happened to > be higher in that test. > > Anyway, thanks for the results! Glad to see the recovery time in general is > lower in hammer. I've also been running with the weighted queue for a week, but testing from more of a stability point of view than performance. I've taken a few OSD's out and let it recover and I haven't seen any negative effects on our normal workloads. > > Mark _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com