Re: Stop Rebalancing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 13, 2022 at 7:07 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Wed, Apr 13, 2022 at 10:01 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
> >
> > I would set the pg_num, not pgp_num. In older versions of ceph you could
> > manipulate these things separately, but in pacific I'm not confident about
> > what setting pgp_num directly will do in this exact scenario.
> >
> > To understand, the difference between these two depends on if you're
> > splitting or merging.
> > First, definitions: pg_num is the number of PGs and pgp_num is the number
> > used for placing objects.
> >
> > So if pgp_num < pg_num, then at steady state only pgp_num pgs actually
> > store data, and the other pg_num-pgp_num PGs are sitting empty.
>
> Wait, what? That's not right! pgp_num is pg *placement* number; it
> controls how we map PGs to OSDs. But the full pg still exists as its
> own thing on the OSD and has its own data structures and objects. If
> currently the cluster has reduced pgp_num it has changed the locations
> of PGs, but it hasn't merged any PGs together. Changing the pg_num and
> causing merges will invoke a whole new workload which can be pretty
> substantial.

Eek, yes, I got this wrong. Somehow I imagined some orthogonal
implementation based on how it appears to work in practice.

In any case, isn't this still the best approach to make all PGs go
active+clean ASAP in this scenario?

1. turn off the autoscaler (for those pools, or fully)
2. for any pool with pg_num_target or pgp_num_target values, get the
current pgp_num X and use it to `ceph osd pool set <pool> pg_num X`.

Can someone confirm that or recommend something different?

Cheers, Dan



> -Greg
>
> >
> > To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer pgs,
> > then decreases pg_num as the PGs are emptied to actually delete the now
> > empty PGs.
> >
> > Splitting is similar but in reverse: first, Ceph creates new empty PGs by
> > increasing pg_num. Then it gradually increases pgp_num to start sending
> > data to the new PGs.
> >
> > That's the general idea, anyway.
> >
> > Long story short, set pg_num to something close to the current
> > pgp_num_target.
> >
> > .. Dan
> >
> >
> > On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, <ray.cunningham@xxxxxxxxxxxxxx>
> > wrote:
> >
> > > Thank you so much, Dan!
> > >
> > > Can you confirm for me that for pool7, which has 2048/2048 for pg_num and
> > > 883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be
> > > different for a single pool, or does pg_num and pgp_num have to always be
> > > the same?
> > >
> > > IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at
> > > 890, is that ok? Because if we reduce the pg_num by 1200 it will just start
> > > a whole new load of misplaced object rebalancing. Won't it?
> > >
> > > Thank you,
> > > Ray
> > >
> > >
> > > -----Original Message-----
> > > From: Dan van der Ster <dvanders@xxxxxxxxx>
> > > Sent: Wednesday, April 13, 2022 11:11 AM
> > > To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx>
> > > Cc: ceph-users@xxxxxxx
> > > Subject: Re:  Stop Rebalancing
> > >
> > > Hi, Thanks.
> > >
> > > norebalance/nobackfill are useful to pause ongoing backfilling, but aren't
> > > the best option now to get the PGs to go active+clean and let the mon db
> > > come back under control. Unset those before continuing.
> > >
> > > I think you need to set the pg_num for pool1 to something close to but
> > > less than 926. (Or whatever the pg_num_target is when you run the command
> > > below).
> > > The idea is to let a few more merges complete successfully but then once
> > > all PGs are active+clean to take a decision about the other interventions
> > > you want to carry out.
> > > So this ought to be good:
> > >     ceph osd pool set pool1 pg_num 920
> > >
> > > Then for pool7 this looks like splitting is ongoing. You should be able to
> > > pause that by setting the pg_num to something just above 883.
> > > I would do:
> > >     ceph osd pool set pool7 pg_num 890
> > >
> > > It may even be fastest to just set those pg_num values to exactly what the
> > > current pgp_num_target is. You can try it.
> > >
> > > Once your cluster is stable again, then you should set those to the
> > > nearest power of two.
> > > Personally I would wait for #53729 to be fixed before embarking on future
> > > pg_num changes.
> > > (You'll have to mute a warning in the meantime -- check the docs after the
> > > warning appears).
> > >
> > > Cheers, dan
> > >
> > > On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham <
> > > ray.cunningham@xxxxxxxxxxxxxx> wrote:
> > > >
> > > > Perfect timing, I was just about to reply. We have disabled autoscaler
> > > on all pools now.
> > > >
> > > > Unfortunately, I can't just copy and paste from this system...
> > > >
> > > > `ceph osd pool ls detail` only 2 pools have any difference.
> > > > pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> > > > pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
> > > >
> > > > ` ceph osd pool autoscale-status`
> > > > Size is defined
> > > > target size is empty
> > > > Rate is 7 for all pools except pool7, which is 1.3333333730697632 Raw
> > > > capacity is defined Ratio for pool1 is .0177, pool7 is .4200 and all
> > > > others is 0 Target and Effective Ratio is empty Bias is 1.0 for all
> > > > PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
> > > > New PG_NUM is empty
> > > > Autoscale is now off for all
> > > > Profile is scale-up
> > > >
> > > >
> > > > We have set norebalance and nobackfill and are watching to see what
> > > happens.
> > > >
> > > > Thank you,
> > > > Ray
> > > >
> > > > -----Original Message-----
> > > > From: Dan van der Ster <dvanders@xxxxxxxxx>
> > > > Sent: Wednesday, April 13, 2022 10:00 AM
> > > > To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx>
> > > > Cc: ceph-users@xxxxxxx
> > > > Subject: Re:  Stop Rebalancing
> > > >
> > > > One more thing, could you please also share the `ceph osd pool
> > > autoscale-status` ?
> > > >
> > > >
> > > > On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham <
> > > ray.cunningham@xxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > Thank you Dan! I will definitely disable autoscaler on the rest of our
> > > pools. I can't get the PG numbers today, but I will try to get them
> > > tomorrow. We definitely want to get this under control.
> > > > >
> > > > > Thank you,
> > > > > Ray
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Dan van der Ster <dvanders@xxxxxxxxx>
> > > > > Sent: Tuesday, April 12, 2022 2:46 PM
> > > > > To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx>
> > > > > Cc: ceph-users@xxxxxxx
> > > > > Subject: Re:  Stop Rebalancing
> > > > >
> > > > > Hi Ray,
> > > > >
> > > > > Disabling the autoscaler on all pools is probably a good idea. At
> > > least until https://tracker.ceph.com/issues/53729 is fixed. (You are
> > > likely not susceptible to that -- but better safe than sorry).
> > > > >
> > > > > To pause the ongoing PG merges, you can indeed set the pg_num to the
> > > current value. This will allow the ongoing merge complete and prevent
> > > further merges from starting.
> > > > > From `ceph osd pool ls detail` you'll see pg_num, pgp_num,
> > > pg_num_target, pgp_num_target... If you share the current values of those
> > > we can help advise what you need to set the pg_num to to effectively pause
> > > things where they are.
> > > > >
> > > > > BTW -- I'm going to create a request in the tracker that we improve
> > > the pg autoscaler heuristic. IMHO the autoscaler should estimate the time
> > > to carry out a split/merge operation and avoid taking one-way decisions
> > > without permission from the administrator. The autoscaler is meant to be
> > > helpful, not degrade a cluster for 100 days!
> > > > >
> > > > > Cheers, Dan
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham <
> > > ray.cunningham@xxxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the
> > > resulting rebalancing of misplaced objects is overwhelming the cluster and
> > > impacting MON DB compaction, deep scrub repairs and us upgrading legacy
> > > bluestore OSDs. We have to pause the rebalancing if misplaced objects or
> > > we're going to fall over.
> > > > > >
> > > > > > Autoscaler-status tells us that we are reducing our PGs by 700'ish
> > > which will take us over 100 days to complete at our current recovery speed.
> > > We disabled autoscaler on our biggest pool, but I'm concerned that it's
> > > already on the path to the lower PG count and won't stop adding to our
> > > misplaced count after drop below 5%. What can we do to stop the cluster
> > > from finding more misplaced objects to rebalance? Should we set the PG num
> > > manually to what our current count is? Or will that cause even more havoc?
> > > > > >
> > > > > > Any other thoughts or ideas? My goals are to stop the rebalancing
> > > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy
> > > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact
> > > when you aren't 100% active+clean).
> > > > > >
> > > > > > Thank you,
> > > > > > Ray
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > > > > > email to ceph-users-leave@xxxxxxx
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux