Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

Alexander Patrakov <patrakov@xxxxxxxxx> · Sat, 18 Jan 2025 00:51:55 +0800

Hello Kasper,

Please be aware that the current "upmap-remapped" script is flaky. It
might just refuse to work, with this message:

Error loading remapped pgs

This has been traced to the fact that "ceph pg ls remapped -f json"
sets its stderr to non-blocking mode, and that is the same file
descriptor to which jq (which follows in the pipeline) writes. Thus,
jq can get -EAGAIN and terminate prematurely.

The problem is tracked as https://tracker.ceph.com/issues/67505

Retrying the script might help.

What's worse is that the whole reason for adding jq to the
upmap-remapped script is another Ceph bug: it sometimes outputs
invalid JSON (containing a literal inf or nan instead of a number),
and this became much more common with Reef, as new fields were added
that are commonly equal to inf or nan. This is tracked as
https://tracker.ceph.com/issues/66215 and has a fix merged in a
not-yet-released version.

Maybe you should look into alternative tools, like
https://github.com/digitalocean/pgremapper

On Fri, Jan 17, 2025 at 11:43 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
>
>
>
> > On Jan 17, 2025, at 6:02 AM, Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx> wrote:
> >
> > However I'm concerned with the amount of data that needs to be rebalanced, since the cluster holds multiple PB, and I'm looking for review of/input for my plan, as well as words of advice/experience from someone who has been in a similar situation.
>
> Yep, that’s why you want to use upmap-remapped.  Otherwise the thundering herd of data shuffling will DoS your client traffic, esp. since you’re using spinners.  Count on pretty much all data moving in the process, and the convergence taking …. maybe a week?
>
> > On Pacific: Data is marked as "degraded", and not misplaced as expected. I also see above 2000% degraded data (but that might be another issue)
> >
> > On Quincy: Data is marked as misplaced - which seems correct.
>
>
> I’m not specifically familiar with such a change, but that could be mainly cosmetic, a function of how the percentage is calculated for objects / PGs that are multiply remapped.
>
> In the depths of time I had clusters that would sometimes show a negative number of RADOS objects to recover, it would bounce above and below zero a few times as it converged to 0.
>
>
> > Instead balancing has been done by a cron job executing - ceph osd reweight-by-utilization 112 0.05 30
>
> I used a similar strategy with older releases.  Note that this will complicate your transition, as those relative weights are a function of the CRUSH topology, so when the topology changes, likely some reweighted OSDs will get much less than their fair share, and some will get much more.  How full is your cluster (ceph df)?  It might not be a bad idea to incrementally revert those all to 1.00000 if you have the capacity, and disable the cron job.
> You’ll also likely want to switch to the balancer module for the upmap-remapped strategy to incrementally move your data around.  Did you have it disabled for a specific reason?
>
> Updating to Reef before migrating might be to your advantage so that you can benefit from performance and efficiency improvements since Pacific.
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx