Re: Weighted Priority Queue testing

Nick Fisk <nick@xxxxxxxxxx> · Wed, 11 May 2016 13:22:03 +0100

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: 11 May 2016 13:16
> To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Nick Fisk
> <nick@xxxxxxxxxx>; Ben England <bengland@xxxxxxxxxx>; Kyle Bader
> <kbader@xxxxxxxxxx>
> Cc: Sage Weil <sweil@xxxxxxxxxx>; Samuel Just <sjust@xxxxxxxxxx>; ceph-
> users@xxxxxxxxxxxxxx
> Subject: Re: Weighted Priority Queue testing
> 
> > 1. First scenario, only 4 node scenario and since it is chassis level
> > replication single node remaining on the chassis taking all the traffic.
> > It seems that is a bottleneck as for the host level replication on the
> > similar setup recovery time is much less (data is not in this table).
> >
> >
> >
> > 2. In the second scenario , I kept everything else same but doubled
> > the node/chassis. Recovery time is also half.
> >
> >
> >
> > 3.  For the third scenario, increased cluster data and also now I have
> > doubled the number of  OSDs in the cluster (since each drive size is
> > 4TB now). Recovery time came down further.
> >
> >
> >
> > 4. Moved to Jewel keeping everything else same, got further
> improvement.
> > Mostly because of improved write performance in jewel (?).
> >
> >
> >
> > 5. Last scenario is interesting. I got improved recovery speed than
> > any other scenario with this WPQ. Degraded PG % came down to 2% within
> > 3 hours , ~0.6% within 4 hours and 15 min , but *last 0.6% took ~4
> > hours* hurting overall time for recovery.
> >
> > 6. In fact, this long tail latency is hurting the overall recovery
> > time for every other scenarios. Related tracker I found is
> > http://tracker.ceph.com/issues/15763
> >
> >
> >
> > Any feedback much appreciated. We can discuss this in tomorrow’s
> > performance call if needed.
> 
> Hi Somnath,
> 
> Thanks for these!  Interesting results.  Did you have a client load going at the
> same time as recovery?  It would be interesting to know how client IO
> performance was affected in each case.  Too bad about the long tail on WPQ.
> I wonder if the long tail is consistently higher with WPQ or it just happened to
> be higher in that test.
> 
> Anyway, thanks for the results!  Glad to see the recovery time in general is
> lower in hammer.

I've also been running with the weighted queue for a week, but testing from more of a stability point of view than performance. I've taken a few OSD's out and let it recover and I haven't seen any negative effects on our normal workloads.

> 
> Mark

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com