I would set the pg_num, not pgp_num. In older versions of ceph you could manipulate these things separately, but in pacific I'm not confident about what setting pgp_num directly will do in this exact scenario. To understand, the difference between these two depends on if you're splitting or merging. First, definitions: pg_num is the number of PGs and pgp_num is the number used for placing objects. So if pgp_num < pg_num, then at steady state only pgp_num pgs actually store data, and the other pg_num-pgp_num PGs are sitting empty. To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer pgs, then decreases pg_num as the PGs are emptied to actually delete the now empty PGs. Splitting is similar but in reverse: first, Ceph creates new empty PGs by increasing pg_num. Then it gradually increases pgp_num to start sending data to the new PGs. That's the general idea, anyway. Long story short, set pg_num to something close to the current pgp_num_target. .. Dan On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, <ray.cunningham@xxxxxxxxxxxxxx> wrote: > Thank you so much, Dan! > > Can you confirm for me that for pool7, which has 2048/2048 for pg_num and > 883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be > different for a single pool, or does pg_num and pgp_num have to always be > the same? > > IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at > 890, is that ok? Because if we reduce the pg_num by 1200 it will just start > a whole new load of misplaced object rebalancing. Won't it? > > Thank you, > Ray > > > -----Original Message----- > From: Dan van der Ster <dvanders@xxxxxxxxx> > Sent: Wednesday, April 13, 2022 11:11 AM > To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx> > Cc: ceph-users@xxxxxxx > Subject: Re: Stop Rebalancing > > Hi, Thanks. > > norebalance/nobackfill are useful to pause ongoing backfilling, but aren't > the best option now to get the PGs to go active+clean and let the mon db > come back under control. Unset those before continuing. > > I think you need to set the pg_num for pool1 to something close to but > less than 926. (Or whatever the pg_num_target is when you run the command > below). > The idea is to let a few more merges complete successfully but then once > all PGs are active+clean to take a decision about the other interventions > you want to carry out. > So this ought to be good: > ceph osd pool set pool1 pg_num 920 > > Then for pool7 this looks like splitting is ongoing. You should be able to > pause that by setting the pg_num to something just above 883. > I would do: > ceph osd pool set pool7 pg_num 890 > > It may even be fastest to just set those pg_num values to exactly what the > current pgp_num_target is. You can try it. > > Once your cluster is stable again, then you should set those to the > nearest power of two. > Personally I would wait for #53729 to be fixed before embarking on future > pg_num changes. > (You'll have to mute a warning in the meantime -- check the docs after the > warning appears). > > Cheers, dan > > On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham < > ray.cunningham@xxxxxxxxxxxxxx> wrote: > > > > Perfect timing, I was just about to reply. We have disabled autoscaler > on all pools now. > > > > Unfortunately, I can't just copy and paste from this system... > > > > `ceph osd pool ls detail` only 2 pools have any difference. > > pool1: pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256 > > pool7: pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048 > > > > ` ceph osd pool autoscale-status` > > Size is defined > > target size is empty > > Rate is 7 for all pools except pool7, which is 1.3333333730697632 Raw > > capacity is defined Ratio for pool1 is .0177, pool7 is .4200 and all > > others is 0 Target and Effective Ratio is empty Bias is 1.0 for all > > PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32. > > New PG_NUM is empty > > Autoscale is now off for all > > Profile is scale-up > > > > > > We have set norebalance and nobackfill and are watching to see what > happens. > > > > Thank you, > > Ray > > > > -----Original Message----- > > From: Dan van der Ster <dvanders@xxxxxxxxx> > > Sent: Wednesday, April 13, 2022 10:00 AM > > To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx> > > Cc: ceph-users@xxxxxxx > > Subject: Re: Stop Rebalancing > > > > One more thing, could you please also share the `ceph osd pool > autoscale-status` ? > > > > > > On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham < > ray.cunningham@xxxxxxxxxxxxxx> wrote: > > > > > > Thank you Dan! I will definitely disable autoscaler on the rest of our > pools. I can't get the PG numbers today, but I will try to get them > tomorrow. We definitely want to get this under control. > > > > > > Thank you, > > > Ray > > > > > > > > > -----Original Message----- > > > From: Dan van der Ster <dvanders@xxxxxxxxx> > > > Sent: Tuesday, April 12, 2022 2:46 PM > > > To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx> > > > Cc: ceph-users@xxxxxxx > > > Subject: Re: Stop Rebalancing > > > > > > Hi Ray, > > > > > > Disabling the autoscaler on all pools is probably a good idea. At > least until https://tracker.ceph.com/issues/53729 is fixed. (You are > likely not susceptible to that -- but better safe than sorry). > > > > > > To pause the ongoing PG merges, you can indeed set the pg_num to the > current value. This will allow the ongoing merge complete and prevent > further merges from starting. > > > From `ceph osd pool ls detail` you'll see pg_num, pgp_num, > pg_num_target, pgp_num_target... If you share the current values of those > we can help advise what you need to set the pg_num to to effectively pause > things where they are. > > > > > > BTW -- I'm going to create a request in the tracker that we improve > the pg autoscaler heuristic. IMHO the autoscaler should estimate the time > to carry out a split/merge operation and avoid taking one-way decisions > without permission from the administrator. The autoscaler is meant to be > helpful, not degrade a cluster for 100 days! > > > > > > Cheers, Dan > > > > > > > > > > > > On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham < > ray.cunningham@xxxxxxxxxxxxxx> wrote: > > > > > > > > Hi Everyone, > > > > > > > > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the > resulting rebalancing of misplaced objects is overwhelming the cluster and > impacting MON DB compaction, deep scrub repairs and us upgrading legacy > bluestore OSDs. We have to pause the rebalancing if misplaced objects or > we're going to fall over. > > > > > > > > Autoscaler-status tells us that we are reducing our PGs by 700'ish > which will take us over 100 days to complete at our current recovery speed. > We disabled autoscaler on our biggest pool, but I'm concerned that it's > already on the path to the lower PG count and won't stop adding to our > misplaced count after drop below 5%. What can we do to stop the cluster > from finding more misplaced objects to rebalance? Should we set the PG num > manually to what our current count is? Or will that cause even more havoc? > > > > > > > > Any other thoughts or ideas? My goals are to stop the rebalancing > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact > when you aren't 100% active+clean). > > > > > > > > Thank you, > > > > Ray > > > > > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > > > > email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx