Re: Upmap balancer after node failure

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 2 Apr 2021 11:22:09 +0200

Hi again,

Oops, I'd missed the part about some PGs being degraded, which
prevents the balancer from continuing.

So I assume that you have PGs which are simultaneously
undersized+backfill_toofull?
That case does indeed sound tricky. To solve that you would either
need to move PGs out of the toofull OSD, to make room for the
undersized PGs; or, upmap those undersized PGs to some other less-full
OSDs.

For the former, you could either use the rm-upmaps-underfull script
and hope that it incidentally moves data out of those toofull OSDs. Or
a similar script with some variables reversed could be used to remove
any upmaps which are directing PGs *to* those toofull OSDs. Or maybe
it will be enough to just reweight those OSDs to 0.9.

-- Dan

On Fri, Apr 2, 2021 at 10:47 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Hi Andras.
>
> Assuming that you've already tightened the
> mgr/balancer/upmap_max_deviation to 1, I suspect that this cluster
> already has too many upmaps.
>
> Last time I checked, the balancer implementation is not able to
> improve a pg-upmap-items entry if one already exists for a PG. (It can
> add an OSD mapping pair to an PG, but not change an existing pair from
> one osd to another).
> So I think that what happens in this case is the balancer gets stuck
> in a sort of local minimum in the overall optimization.
>
> It can therefore help to simply remove some upmaps, and then wait for
> the balancer to do a better job when it re-creates new entries for
> those PGs.
> And there's usually some low hanging fruit -- you can start by
> removing pg-upmap-items which are mapping PGs away from the least full
> OSDs. (Those upmap entries are making the least full OSDs even *less*
> full.)
>
> We have a script for that:
> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/rm-upmaps-underfull.py
> It's a pretty hacky and I don't use it often, so please use it with
> caution -- you can run it and review which upmaps it would remove.
>
> Hope this helps,
>
> Dan
>
>
>
> On Fri, Apr 2, 2021 at 10:18 AM Andras Pataki
> <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Dear ceph users,
> >
> > On one of our clusters I have some difficulties with the upmap
> > balancer.  We started with a reasonably well balanced cluster (using the
> > balancer in upmap mode).  After a node failure, we crush reweighted all
> > the OSDs of the node to take it out of the cluster - and waited for the
> > cluster to rebalance.  Obviously, this significantly changes the crush
> > map - hence the nice balance created by the balancer was gone.  The
> > recovery mostly completed - but some of the OSDs became too full - so we
> > neded up with a few PGs that were backfill_toofull.  The cluster has
> > plenty of space (overall perhaps 65% full), only a few OSDs are >90% (we
> > have backfillfull_ratio at 92%).  The balancer refuses to change
> > anything since the cluster is not clean.  Yet - the cluster can't become
> > clean without a few upmaps to help the top 3 or 4 most full OSDs.
> >
> > I would think this is a fairly common situation - trying to recover
> > after some failure.  Are there any recommendations on how to proceed?
> > Obviously I can manually find and insert upmaps - but for a large
> > cluster with tens of thousands of PGs, that isn't too practical.  Is
> > there a way to tell the balancer to still do something even though some
> > PGs are undersized (with a quick look at the python module - I didn't
> > see any)?
> >
> > The cluster is on Nautilus 14.2.15.
> >
> > Thanks,
> >
> > Andras
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx