Re: Unbalanced Cluster

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 5 May 2022 12:15:12 -0700

> The balancer was driving all the weights to 1.00000 so I turned it off.  

Which weights (CRUSH or reweight?) And which balancer?

Assuming the ceph-mgr balancer module in upmap mode, you’d want the reweight values to be 1.000 since it uses the newer pg-upmap functionality to distribute capacity.  Lower reweight values have a way of confusing the balancer and preventing good uniformity.  If you had a bunch of significantly adjusted reweight values, eg. from prior runs of reweight-by-utilization, that could contribute to suboptimal balancing.

> You mentioned that all solutions would cause data migration and would need to be planned carefully.  I've seen that language in the docs and other messages but what I can't find is what is meant by "planned carefully".

There are many ways to proceed; documenting them all might be a bit of a rabbit-hole.

> Doing any of these will cause data migration like crazy but it's not avoidable other than to change the number of max backfills etc. but the systems should still be accessible during this time but with reduced bandwidth and higher latency.  Is it just a warning that the system could be degraded for a long period of time or is it suggesting that users should take an outage while the rebuild happens?

Throttling recovery/backfill can reduce the impact of big data migrations, at the expense of increased elapsed time to complete.

osd_max_backfills=1
osd_recovery_max_active=1
osd_recovery_op_priority=1
osd_recovery_max_single_start=1
osd_scrub_during_recovery=false

Also, ensure that

osd_op_queue_cut_off = high

This will help ensure that recovery / backfill doesn’t DoS client traffic.  I’m not sure if this is default in your release.  If changed, I believe that OSDs would need to be restarted for the new value to take effect.

PGs:

pg_num = ( #OSDs * ratio ) / replication
ratio = pg_num * replication / #OSDs

On clusters with multiple pools this can get a bit complicated when more than one pool have significant numbers of PGs; the end goal is the total number of PGs on a given OSD, which `ceph osd df` reports.

Your OSDs look to have ~~ 190 PGs each on average, which is probably ok given your media.  If you do have big empty pools, deleting them would show more indicative numbers.  PG ratio targets are somewhat controversial, but depending on your media and RAM an aggregate around this range is reasonable; you can go higher with flash.

This calculator can help when you have multiple pools:

https://old.ceph.com/pgcalc/

If you need to bump pg_num for a pool, you don’t have to do it in one step.  You can increase it by, say, 32 at a time.

> 
> Thanks for your guidance.
> 
> -Dave
> 
> 
> On 2022-05-05 2:33 a.m., Erdem Agaoglu wrote:
> [△EXTERNAL]
> 
> 
> Hi David,
> 
> I think you're right with your option 2. 512 pgs is just too few. You're also
> right with the "inflation" but you should add your erasure bits to the
> calculation, so 9x512=4608. With 144 OSDs, you would average 32 pgs per OSD.
> Some old advice for that number was around 100.
> 
> But your current PGs per OSD is around 180-190 according to the df output. This
> is probably because of your empty pool 4 fsdata, having 4096 pgs with size 5,
> and adding 5x4096=20480, 20480/144=142 more pgs per OSD.
> 
> I'm not really sure how empty/unused PGs would affect OSD, but I think it will
> affect the balancer which tries to balance the number of PGs, which might
> explain things getting worse. Also your df output shows several modifications in
> weights/reweights but I'm not sure if they're manual or balancer adjusted.
> 
> I would first delete that empty pool to have a more clear picture of PGs on
> OSDs. Then I would increase the pg_num for pool 6 to 2048. And after everything
> settles, if it's still too unbalanced I'd go for the upmap balancer. Needless to
> say, all these would cause major data migration so it should be planned
> carefully.
> 
> Best,
> 
> 
> 
> On Thu, May 5, 2022 at 12:02 AM David Schulz <dschulz@xxxxxxxxxxx<mailto:dschulz@xxxxxxxxxxx>> wrote:
> Hi Josh,
> 
> We do have an old pool that is empty so there's 4611 empty PGs but the
> rest seem fairly close:
> 
> # ceph pg ls|awk '{print $7/1024/1024/10}'|cut -d "." -f 1|sed -e
> 's/$/0/'|sort -n|uniq -c
>    4611 00
>       1 1170
>       8 1180
>      10 1190
>      28 1200
>      51 1210
>      54 1220
>      52 1230
>      32 1240
>      13 1250
>       7 1260
> Hmm, that's interesting, adding up the first column except the 4611
> gives 256 but there are 512 PGs in the main data pool.
> 
> Here are our pool settings:
> 
> pool 3 'fsmeta' replicated size 3 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 35490
> flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
> recovery_priority 5 application cephfs
> pool 4 'fsdata' erasure size 5 min_size 4 crush_rule 1 object_hash
> rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 35490
> lfor 0/0/4742 flags hashpspool,ec_overwrites stripe_width 12288
> application cephfs
> pool 6 'fsdatak7m2' erasure size 9 min_size 8 crush_rule 3 object_hash
> rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 35490
> flags hashpspool,ec_overwrites stripe_width 28672 application cephfs
> 
> The fsdata was originally created with very safe erasure coding that
> wasted too much space, then the fsdatak7m2 was created and everything
> was migrated to it.  This is why there's at least 4096 pgs with 0 bytes.
> 
> -Dave
> 
> On 2022-05-04 2:08 p.m., Josh Baergen wrote:
>> [△EXTERNAL]
>> 
>> 
>> 
>> Hi Dave,
>> 
>>> This cluster was upgraded from 13.x to 14.2.9 some time ago.  The entire
>>> cluster was installed at the 13.x time and was upgraded together so all
>>> OSDs should have the same formatting etc.
>> OK, thanks, that should rule out a difference in bluestore
>> min_alloc_size, for example.
>> 
>>> Below is pasted the ceph osd df tree output.
>> It looks like there is some pretty significant skew in terms of the
>> amount of bytes per active PG. If you issue "ceph pg ls", are you able
>> to find any PGs with a significantly higher byte count?
>> 
>> Josh
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
> 
> 
> --
> erdem agaoglu
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx