Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx> · Mon, 20 Jan 2025 07:04:08 +0000

On Pacific -
It seems like when data is marked as degraded - no pgs are remapped, and the upmap-remapped.py is consistently returning - "There are no remapped PGs"
Also nobackfill and noreabalance has no affect in holding back any remapping (norecover has).

The recovering of the degraded files seems to be doing the remapping.

So, deploying a new crush-map in Pasific seems to be a big-bang thing with no control handles.

Balancing:
My cluster has a 55% RAW Used.
The balancer was disabled before I took over the cluster, unfortunately I do not have the full history of that - I believe it had something to do with it not working or being way to ineffective
Your advice to revert the weights to 1.0000.. is to give the balancer a starting point or?

My conclusion for now is, that since an upgrade to Quincy or Reef is already in the pipeline for the cluster, I'll go ahead and do that first before adding racks to my crushmap.

________________________________
From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
Sent: Friday, January 17, 2025 16:06
To: Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: Re:  Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

On Jan 17, 2025, at 6:02 AM, Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx> wrote:

However I'm concerned with the amount of data that needs to be rebalanced, since the cluster holds multiple PB, and I'm looking for review of/input for my plan, as well as words of advice/experience from someone who has been in a similar situation.

Yep, that’s why you want to use upmap-remapped.  Otherwise the thundering herd of data shuffling will DoS your client traffic, esp. since you’re using spinners.  Count on pretty much all data moving in the process, and the convergence taking …. maybe a week?

On Pacific: Data is marked as "degraded", and not misplaced as expected. I also see above 2000% degraded data (but that might be another issue)

On Quincy: Data is marked as misplaced - which seems correct.

I’m not specifically familiar with such a change, but that could be mainly cosmetic, a function of how the percentage is calculated for objects / PGs that are multiply remapped.

In the depths of time I had clusters that would sometimes show a negative number of RADOS objects to recover, it would bounce above and below zero a few times as it converged to 0.

Instead balancing has been done by a cron job executing - ceph osd reweight-by-utilization 112 0.05 30

I used a similar strategy with older releases.  Note that this will complicate your transition, as those relative weights are a function of the CRUSH topology, so when the topology changes, likely some reweighted OSDs will get much less than their fair share, and some will get much more.  How full is your cluster (ceph df)?  It might not be a bad idea to incrementally revert those all to 1.00000 if you have the capacity, and disable the cron job.
You’ll also likely want to switch to the balancer module for the upmap-remapped strategy to incrementally move your data around.  Did you have it disabled for a specific reason?

Updating to Reef before migrating might be to your advantage so that you can benefit from performance and efficiency improvements since Pacific.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx