Re: Weighted Priority Queue testing

Christian Balzer <chibi@xxxxxxx> · Sat, 14 May 2016 14:53:09 +0900

Hello again,

On Fri, 13 May 2016 14:17:22 +0000 Somnath Roy wrote:

> Thanks Christian for the input.
> I will start digging the code and look for possible explanation.
> 

To be fair, after a while more PGs become involved, up to to a backfill
count of 18 (that's 9 actually backfill operations as it counts both reads
and writes). But the last OSD of the 6 new ones didn't see any action
until nearly 3 hours into the process.

As they say, a picture is worth a thousand words, this the primary PG
distribution during that backfill operation:

https://i.imgur.com/mp6yUW7.png

Start point is a 3 node cluster, 1 node with 2 large OSDs (to be
replaced with a 6 OSDs one later), one node with 6 OSDs and another 6 OSD
node with all OSDs set to a crush weight of 0.

Note that the 6 OSDs nodes have their OSD max backfill set to 4, the node
with the 2 large OSDs is at 1. This of course explains some of the
behavior seen here, but not all of it by far.

So at 15:00 I did set the crush weight of all new 6 OSD to 5 (the same as
all the others). 
As you can see, the first OSD (in order of weight change) starts growing
right away, the second shortly after that.

But it takes 20 minutes until the 3rd OSD sees some action, 1 hour for the
4th, nearly 2 hours for the 5th and as said nearly 3 hours for the 6th and
last one.

Again, some of this can be explained by the max backfill of 1 for the 2
large OSDs, but even they were idle for about 20% of the time and never
should have been.
And the 6 smaller existing OSDs should have seen up to 4 backfills (reads
mostly), but never did. 

So to recap, things happen sequentially, when they should be randomized and
optimized.

My idea of how this _should_ work (and clearly doesn't) would be:

Iterate over all PGs with pending backfills ops (optionally start
each loop at a random point ala the Exim queue runner), find a target
(write) OSD that is below the max backfill, then match this with a source
(read) OSD that also has enough backfill credits.

The matching bit is the important bit, if there isnt't a source OSD
available for any of the waiting backfills on the target OSD, go to the
next source OSD, if all source OSDs for that target OSD are busy, go the
next target OSD.

This way should get things going at full speed right from the start.

After that one could think about optimizing the above with weighted
priorities and buckets (prioritize the bucket of the OSD with the most
target PGs).

Regards,

Christian

> Regards
> Somnath
> 
> -----Original Message-----
> From: Christian Balzer [mailto:chibi@xxxxxxx]
> Sent: Thursday, May 12, 2016 11:52 PM
> To: Somnath Roy
> Cc: Scottix; ceph-users@xxxxxxxxxxxxxx; Nick Fisk
> Subject: Re:  Weighted Priority Queue testing
> 
> 
> Hello,
> 
> On Fri, 13 May 2016 05:46:41 +0000 Somnath Roy wrote:
> 
> > FYI in my test I used osd_max_backfills = 10 which is hammer default.
> > Post hammer it's been changed to 1.
> >
> All my tests, experiences are with Firefly and Hammer.
> 
> Also FYI and possibly pertinent to this discussion, I just added a node
> with 6 OSDs to one of my clusters. I did this by initially adding things
> with a crush weight of 0 (so nothing happened) and then in one fell
> swoop set the weights of all those OSDs to 5.
> 
> Now what I'm seeing (and remembering seeing before) is that Ceph is
> processing this very sequentially, meaning it is currently backfilling
> the first 2 OSDs and doing nothing of the sorts with the other 4, they
> are idle.
> 
> "osd_max_backfills" is set to 4, which is incidentally the number of
> backfills happening on the new node now, however this is per OSD, so in
> theory we could expect 24 backfills. The prospective source OSDs aren't
> pegged with backfills either, they have 1-2 going on.
> 
> I'm seriously wondering if this behavior is related to what we're
> talking about here.
> 
> Christian
> 
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Christian Balzer [mailto:chibi@xxxxxxx]
> > Sent: Thursday, May 12, 2016 10:40 PM
> > To: Scottix
> > Cc: Somnath Roy; ceph-users@xxxxxxxxxxxxxx; Nick Fisk
> > Subject: Re:  Weighted Priority Queue testing
> >
> >
> > Hello,
> >
> > On Thu, 12 May 2016 15:41:13 +0000 Scottix wrote:
> >
> > > We have run into this same scenarios in terms of the long tail
> > > taking much longer on recovery than the initial.
> > >
> > > Either time we are adding osd or an osd get taken down. At first we
> > > have max-backfill set to 1 so it doesn't kill the cluster with io.
> > > As time passes by the single osd is performing the backfill. So we
> > > are gradually increasing the max-backfill up to 10 to reduce the
> > > amount of time it needs to recover fully. I know there are a few
> > > other factors at play here but for us we tend to do this procedure
> > > every time.
> > >
> >
> > Yeah, as I wrote in my original mail "This becomes even more obvious
> > when backfills and recovery settings are lowered".
> >
> > However my test cluster is at the default values, so it starts with a
> > (much too big) bang and ends with a whimper, not because it's
> > throttled but simply because there are so few PGs/OSDs to choose from.
> > Or so it seems, purely from observation.
> >
> > Christian
> > > On Wed, May 11, 2016 at 6:29 PM Christian Balzer <chibi@xxxxxxx>
> > > wrote:
> > >
> > > > On Wed, 11 May 2016 16:10:06 +0000 Somnath Roy wrote:
> > > >
> > > > > I bumped up the backfill/recovery settings to match up Hammer.
> > > > > It is probably unlikely that long tail latency is a parallelism
> > > > > issue. If so, entire recovery would be suffering not the tail
> > > > > alone. It's probably a prioritization issue. Will start looking
> > > > > and update my findings. I can't add devl because of the table
> > > > > but needed to add community that's why ceph-users :-).. Also,
> > > > > wanted to know from Ceph's user if they are also facing similar
> > > > > issues..
> > > > >
> > > >
> > > > What I meant with lack of parallelism is that at the start of a
> > > > rebuild, there are likely to be many candidate PGs for recovery
> > > > and backfilling, so many things happen at the same time, up to the
> > > > limits of what is configured (max backfill etc).
> > > >
> > > > From looking at my test cluster, it starts with 8-10 backfills and
> > > > recoveries (out of 140 affected PGs), but later on in the game
> > > > there are less and less PGs (and OSDs/nodes) to choose from, so
> > > > things slow down around 60 PGs to just 3-4 backfills.
> > > > And around 20 PGs it's down to 1-2 backfills, so the parallelism
> > > > is clearly gone at that point and recovery speed is down to what a
> > > > single PG/OSD can handle.
> > > >
> > > > Christian
> > > >
> > > > > Thanks & Regards
> > > > > Somnath
> > > > >
> > > > > -----Original Message-----
> > > > > From: Christian Balzer [mailto:chibi@xxxxxxx]
> > > > > Sent: Wednesday, May 11, 2016 12:31 AM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; Nick Fisk; ceph-users@xxxxxxxxxxxxxx
> > > > > Subject: Re:  Weighted Priority Queue testing
> > > > >
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > not sure if the Cc: to the users ML was intentional or not, but
> > > > > either way.
> > > > >
> > > > > The issue seen in the tracker:
> > > > > http://tracker.ceph.com/issues/15763
> > > > > and what you have seen (and I as well) feels a lot like the lack
> > > > > of parallelism towards the end of rebuilds.
> > > > >
> > > > > This becomes even more obvious when backfills and recovery
> > > > > settings are lowered.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Christian
> > > > > --
> > > > > Christian Balzer        Network/Systems Engineer
> > > > > chibi@xxxxxxx   Global OnLine Japan/Rakuten Communications
> > > > > http://www.gol.com/
> > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > message is intended only for the use of the designated
> > > > > recipient(s) named above. If the reader of this message is not
> > > > > the intended recipient, you are hereby notified that you have
> > > > > received this message in error and that any review,
> > > > > dissemination, distribution, or copying of this message is
> > > > > strictly prohibited. If you have received this communication in
> > > > > error, please notify the sender by telephone or e-mail (as shown
> > > > > above) immediately and destroy any and all copies of this
> > > > > message in your possession (whether hard copies or
> > > > > electronically stored copies).
> > > > >
> > > >
> > > >
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > PLEASE NOTE: The information contained in this electronic mail message
> > is intended only for the use of the designated recipient(s) named
> > above. If the reader of this message is not the intended recipient,
> > you are hereby notified that you have received this message in error
> > and that any review, dissemination, distribution, or copying of this
> > message is strictly prohibited. If you have received this
> > communication in error, please notify the sender by telephone or
> > e-mail (as shown above) immediately and destroy any and all copies of
> > this message in your possession (whether hard copies or electronically
> > stored copies).
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named above.
> If the reader of this message is not the intended recipient, you are
> hereby notified that you have received this message in error and that
> any review, dissemination, distribution, or copying of this message is
> strictly prohibited. If you have received this communication in error,
> please notify the sender by telephone or e-mail (as shown above)
> immediately and destroy any and all copies of this message in your
> possession (whether hard copies or electronically stored copies).
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com