Re: Handling out-of-balance OSD?

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 28 Jul 2021 10:40:42 +0200

Wait, after re-reading my own ticket I realized you can more easily
remove the leftover PGs by re-peering the *other* osds.

"I found a way to remove those leftover PGs (without using
ceph-objectstore-tool): If the PG re-peers, then osd.74 notices he's
not in the up/acting set then starts deleting the PG. So at the moment
I'm restarting those former peers to trim this OSD."

-- dan

On Wed, Jul 28, 2021 at 10:37 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Cool, looks like the second problem is the real issue here :)
>
> IIRC, you can remove the leftover PGs with ceph-objectstore-tool. I
> don't recall the exact syntax, but you'd need to find out which PGs
> are not mapped there by the current crush rule and remove the others.
> Or, you can zap and re-create the OSD.
>
> -- dan
>
>
> On Wed, Jul 28, 2021 at 10:34 AM Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> wrote:
> >
> > How "wide" is "wide". I have 4 nodes and 140 HDD OSDs. Here is the info as from the Ceph system:
> >
> > # ceph osd erasure-code-profile get hdd_ec
> > crush-device-class=hdd
> > crush-failure-domain=host
> > crush-root=default
> > jerasure-per-chunk-alignment=false
> > k=2
> > m=1
> > plugin=jerasure
> > technique=reed_sol_van
> > w=8
> >
> > Here is what your script gives:
> >
> > # python tools/ceph-pool-pg-distribution 3
> > Searching for PGs in pools: ['3']
> > Summary: 2048 PGs on 140 osds
> >
> > Num OSDs with X PGs:
> > 43: 16
> > 44: 124
> >
> > ... and finally your last proposal, so it looks like I have some left-over pgs, see below. I'm also observing PG values than 43/44 on other osds in the system.
> >
> > # ceph daemon osd.0 status
> > {
> >     "cluster_fsid": "55633ec3-6c0c-4a02-990c-0f87e0f7a01f",
> >     "osd_fsid": "85e266f1-8d8c-4c2a-b03c-0aef9bc4e532",
> >     "whoami": 0,
> >     "state": "active",
> >     "oldest_map": 99775,
> >     "newest_map": 281713,
> >     "num_pgs": 77
> > }
> >
> > I found this ticket (https://tracker.ceph.com/issues/38931 I believe you actually opened it ;-)) and tried to restart the osd.0 and now the OSD is scrubbing some of its pgs... However, I'm uncertain that this is actually trimming the left-over pgs.
> >
> > Thanks for all your help up to this point already!
> >
> > Best wishes,
> > Manuel
> >
> > On Wed, Jul 28, 2021 at 9:55 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >>
> >> How wide is hdd_ec? With a wide EC rule and relatively few OSDs and
> >> relatively few PGs per OSD for the pool, it can be impossible for the
> >> balancer to make things perfect.
> >> It would help to look at the PG distribution for only the hdd_ec pool
> >> -- this script can help
> >> https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-pool-pg-distribution
> >>
> >> Another possibility is that osd.0 has some leftover data from PGs that
> >> should have been deleted. From the box, check: `ceph daemon osd.0
> >> status` and compare the number of PGs it holds vs the value in your
> >> osd df output (48).
> >>
> >> -- dan
> >>
> >> On Wed, Jul 28, 2021 at 9:24 AM Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> wrote:
> >> >
> >> > Hi,
> >> >
> >> > thanks for your quick response. I already did this earlier this week:
> >> >
> >> > # ceph config dump | grep upmap_max_deviation
> >> >   mgr       advanced mgr/balancer/upmap_max_deviation                    1
> >> >
> >> > Cheers,
> >> > Manuel
> >> >
> >> > On Wed, Jul 28, 2021 at 9:15 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Start by setting:
> >> >>
> >> >>     ceph config set mgr mgr/balancer/upmap_max_deviation 1
> >> >>
> >> >> This configures the balancer to squeeze the OSDs to within 1 PG of eachother.
> >> >>
> >> >> I'm starting to think this should be the default.
> >> >>
> >> >> Cheers, dan
> >> >>
> >> >>
> >> >> On Wed, Jul 28, 2021 at 9:08 AM Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> wrote:
> >> >> >
> >> >> > Dear all,
> >> >> >
> >> >> > I'm running Ceph 14.2.11. I have 140 HDDs in my cluster of 4 nodes, 35 HDDs
> >> >> > per node. I am observing fill ratios of 66% to 70% of OSDs and then one
> >> >> > with 82% (see attached ceph-osd-df.txt for output of "ceph osd df").
> >> >> >
> >> >> > Previously, I had problems with single OSDs filling up to 85% and then
> >> >> > everything coming to a grinding halt. Ideally, I would like to have all OSD
> >> >> > fill grade to be close to the mean of 67%... At the very least I need to
> >> >> > get the 82% OSD back into the range.
> >> >> >
> >> >> > I have upmap balancing enabled and the balancer says:
> >> >> >
> >> >> > # ceph balancer status
> >> >> > {
> >> >> >     "last_optimize_duration": "0:00:00.053686",
> >> >> >     "plans": [],
> >> >> >     "mode": "upmap",
> >> >> >     "active": true,
> >> >> >     "optimize_result": "Unable to find further optimization, or pool(s)
> >> >> > pg_num is decreasing, or distribution is already perfect",
> >> >> >     "last_optimize_started": "Wed Jul 28 09:03:02 2021"
> >> >> > }
> >> >> >
> >> >> > Creating an offline balancing plan looks like this:
> >> >> >
> >> >> > # ceph osd getmap -o om
> >> >> > got osdmap epoch 281708
> >> >> > # osdmaptool om --upmap out.txt --upmap-pool hdd_ec --upmap-deviation 1
> >> >> > --upmap-active
> >> >> > osdmaptool: osdmap file 'om'
> >> >> > writing upmap command output to: out.txt
> >> >> > checking for upmap cleanups
> >> >> > upmap, max-count 10, max deviation 1
> >> >> >  limiting to pools hdd_ec ([3])
> >> >> > pools hdd_ec
> >> >> > prepared 0/10 changes
> >> >> > Time elapsed 0.0275739 secs
> >> >> > Unable to find further optimization, or distribution is already perfect
> >> >> > osd.0 pgs 43
> >> >> > [...]
> >> >> > # wc -l out.txt
> >> >> > 0 out.txt
> >> >> >
> >> >> > Does anyone have a suggestion how to proceed getting the 82% OSD closer to
> >> >> > the mean fill ratio (and maybe the other OSDs as well?)
> >> >> >
> >> >> > Thanks,
> >> >> > Manuel
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx