Re: Ceph recovery network speed

Frank Schilder <frans@xxxxxx> · Wed, 27 Jul 2022 10:22:09 +0000

Hi all, just an update. We got our octopus test cluster up and found that setting

    ceph osd pool set POOL_NAME pg_autoscale_mode off

on each pool restores default behaviour. Setting pg[p]_num on a pool creates and activates all PGs immediately without the auto-scaler being involved. We will set the default

    ceph config set global osd_pool_default_pg_autoscale_mode off

to disable auto-scaler for good.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 07 July 2022 09:32:19
To: Stefan Kooman; Curt
Cc: Robert Gallop; ceph-users@xxxxxxx
Subject:  Re: Ceph recovery network speed

Dear Stefan,

> If you decrease pgp_num with for example 1 PG, have it remap /
> rebalance, and after that increase pg_num ... and pgp_num to the same
> number, it will be set instantly.

the code fragment you sent indicates that only the pg[p]_num_targets are updated, not pg[p]_num itself. Your trick with reducing pgp_num will, therefore, probably not work. It just adjusts targets and the MGR then does some magic.

This is also what Curt's output indicates, he set pg_num to 2048 and its not at this value. This is not just wrong for how its implemented, its also counter intuitive. If someone wants to let the MGR adjust stuff, they should have commands available to set the targets instead of the values. This is highly illogical.

> We use CERNs "upmap-remapped.py" to quickly have Ceph in HEALTH_OK, and
> have the balancer remove upmaps to move data slowly. [...] .. have it as default in the
> balancer code (besides some other functionality that would be nice to have).

I would say that changing defaults so drastically is never a good idea. In addition, such functionality should be an (opt-in) alternative, not a replacement of existing behaviour. For me, there is not really a need for a special magic on increases of pg[p]_num to keep PGs from entering backfill-wait. If I see how many PGs are in backfill-wait, I have a good idea how far in the process the cluster is. The mimic functionality is perfectly fine for me.

> > Is there really no way to do this in an atomic operation any more? Would target_max_misplaced_ratio=100% do the trick?
> I have tried this on a few test clusters ... and it indeed seems to do the trick.

Super, good to know. Thanks for trying this out. I personally think that "old" behaviour should always remain available as a special case of new behaviour. In fact, it should just continue working as before.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx