Re: Advice on increasing pgs

Robin Percy <rpercy@xxxxxxxxx> · Tue, 12 Jul 2016 23:17:02 +0000

Thanks for the clarification Christian.  Good to know about the potential increase in OSD usage. As you said, given how much available capacity we have, we're betting on the distribution not getting much worse. But we'll look at re-weighting if things go sideways.

Cheers,
Robin

On Mon, Jul 11, 2016 at 11:07 PM Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Tue, 12 Jul 2016 03:43:41 +0000 Robin Percy wrote:

> First off, thanks for the great response David.

>

Yes, that was a very good writeup.

> If I understand correctly, you're saying there are two distinct costs to

> consider: peering, and backfilling. The backfilling cost is a function of

> the amount of data in our pool, and therefore won't benefit from

> incremental steps. But the peering cost is a function of pg_num, and should

> be incremented in steps of at most ~200 (depending on hardware) until we

> reach a power of 2.

>

Peering is all about RAM (more links, states, permanently so), CPU and

network (when setting up the links).

And this happens instantaneously, with no parameters in Ceph to slow this

down.

So yes, you want to increase the pg_num and pgp_num somewhat slowly, at

least at first until you have a feel for what you HW can handle.

> Assuming I've got that right, one follow up question is: should we expect

> blocked/delayed requests during both the peering and backfilling processes,

> or is it more common in one than the other? I couldn't quite get a

> definitive answer from the docs on peering.

>

Peering is a sharp shock, it should be quick to resolve (again, depending

on HW, etc) and not lead to noticeable interruptions.

But YMMWV, thus again initial baby steps.

Backfilling is that inevitable avalanche, but if you start with

osd_max_backfills=1 and then creep it up as you get a feel of what you

cluster can handle you should be able to both keep slow requests at bay

AND hopefully finish within a reasonable sized maintenance window.

Since you're still on Firefly, you won't be getting the queue benefits of

Jewel, which should help with backfilling stomping on client traffic toes

as well.

OTOH, you're currently only using a fraction of your cluster's capabilities

(64 PGs with 126 OSDs), so there should be quite some capacity for this

reshuffle available.

> At this point we're planning to hedge our bets by increasing pg_num to 256

> before backfilling so we can at least buy some headroom on our full OSDs

> and evaluate the impact before deciding whether we can safely make the

> jumps to 2048 without an outage. If that doesn't make sense, I may be

> overestimating the cost of peering.

>

As David said, freeze your cluster (norecover, nobackfill, nodown and

noout), slowly up your PGs and PGPs then let the good times roll and

unleash the dogs of backfill.

The thing that worries me the most in your scenario are the already

near-full OSDs.

As many people found out the hard way, Ceph may initially go and put MORE

data on OSDs before later distributing things more evenly.

See for example this mail from me and the image URL in it:

http://www.spinics.net/lists/ceph-users/msg27794.html

Normally my advise would be to re-weight the full (or near empty) OSDs so

that things get a bit more evenly distributed and below near-full levels

before starting the PG increase.

But in your case with so few PGs to begin with, it's going to be tricky to

get it right and not make things worse.

Hopefully the plentiful PG/OSD choices Ceph has after the PG increase in

your case will make it do the right thing from the get-go.

Christian

> Thanks again for your help,

> Robin

>

>

> On Mon, Jul 11, 2016 at 2:40 PM David Turner <david.turner@xxxxxxxxxxxxxxxx>

> wrote:

>

> > When you increase your PGs you're already going to be moving around all of

> > your data.  Doing a full doubling of your PGs from 64 -> 128 -> 256 -> ...

> > -> 2048 over and over and letting it backfill to healthy every time is a

> > lot of extra data movement that isn't needed.

> >

> > I would recommend setting osd_max_backfills to something that won't

> > cripple your cluster (5 works decently for us), set the norecover,

> > nobackfill, nodown, and noout flags, and then increase your pg_num and

> > pgp_num slowly until you reach your target.  Depending on how much extra

> > RAM you have in each of your storage nodes depends on how much you want to

> > increase pg_num by at a time.  We don't do more than ~200 at a time.  When

> > you reach your target and there is no more peering happening, then unset

> > norecover, nobackfill, and nodown.  After you finish all of the

> > backfilling, then unset noout.

> >

> > You are likely to see slow/blocked requests in your cluster throughout

> > this process, but the best thing is to get to the other side of increasing

> > your pgs.  The official recommendation for increasing pgs is to plan ahead

> > for the size of your cluster and start with that many pgs because this

> > process is painful and will slow down your cluster until it's done.

> >

> > Note, if you're increasing pgs from 2048 to 4096, then doing it in smaller

> > chunks of 512 at a time could make sense because of how ceph treats pools

> > with a non-base 2 number of pgs.  if you have 8 pgs that are 4GB and

> > increase the number to 10 (a non-power of 2) then you will have 6 pgs that

> > are 4GB and 4 pgs that are 2GB.  It splits them in half to fill up the

> > number of pgs that aren't a power of 2.  If you went to 14 pgs, then you

> > would have 2 pgs that are 4GB and 12 pgs that are 2GB.  Finally when you

> > set it to 16 pgs you would have 16 pgs that are all 2GB.

> >

> > So if you increase your PGs by less than a power of 2, then it will only

> > work on  that number of pgs and leave the rest of them alone.  However in

> > your scenario of going from 64 pgs to 2048, you are going to be affecting

> > all of the PGs every time you split and buy yourself nothing by doing it in

> > smaller chunks.  The reason to not just increase pg_num to 2048 is that

> > when ceph creates each PG it has to peer and you can peer your osds into

> > oblivion and lose access to all of your data for a while, that's why the

> > recommendation to add them bit by bit with nodown, noout, nobackfill, and

> > norecover set so that you get to the number you want and then can tell your

> > cluster to start moving data.

> > ------------------------------

> > *From:* ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Robin

> > Percy [rpercy@xxxxxxxxx]

> > *Sent:* Monday, July 11, 2016 2:53 PM

> > *To:* ceph-users@xxxxxxxx

> > *Subject:*  Advice on increasing pgs

> >

> > Hello,

> >

> > I'm looking for some advice on how to most safely increase the pgs in our

> > primary ceph pool.

> >

> > A bit of background: We're running ceph 0.80.9 and have a cluster of 126

> > OSDs with only 64 pgs allocated to the pool. As a result, 2 OSDs are now

> > 88% full, while the pool is only showing as 6% used.

> >

> > Based on my understanding, this is clearly a placement problem, so the

> > plan is to increase to 2048 pgs. In order to avoid significant performance

> > degradation, we'll be incrementing pg_num and pgp_num one power of two at a

> > time and waiting for the cluster to rebalance before making the next

> > increment.

> >

> > My question is: are there any other steps we can take to minimize

> > potential performance impact? And/or is there a way to model or predict the

> > level of impact, based on cluster configuration, data placement, etc?

> >

> > Thanks in advance for any answers,

> > Robin

> >

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com