Re: Many misplaced PG's, full OSD's and a good amount of manual intervention to keep my Ceph cluster alive.

Bruno Gomes Pessanha <bruno.pessanha@xxxxxxxxx> · Mon, 6 Jan 2025 15:58:08 +0100

+ceph-users

On Mon, 6 Jan 2025 at 15:55, Bruno Gomes Pessanha <bruno.pessanha@xxxxxxxxx>
wrote:

> Old-style legacy override reweights don’t mesh well with the balancer.
>>  Best to leave them at 1.00.
>>
> Makes sense. I'll leave it at 1.0 and won't touch it again.
>
> Please send `ceph osd crush rule dump `.  And `ceph osd dump | grep pool`
>
>
>  # ceph osd crush rule dump
> [
>     {
>         "rule_id": 0,
>         "rule_name": "replicated_rule",
>         "type": 1,
>         "steps": [
>             {
>                 "op": "take",
>                 "item": -1,
>                 "item_name": "default"
>             },
>             {
>                 "op": "chooseleaf_firstn",
>                 "num": 0,
>                 "type": "host"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     },
>     {
>         "rule_id": 1,
>         "rule_name": "cephfs.cephfs01.data",
>         "type": 3,
>         "steps": [
>             {
>                 "op": "set_chooseleaf_tries",
>                 "num": 5
>             },
>             {
>                 "op": "set_choose_tries",
>                 "num": 100
>             },
>             {
>                 "op": "take",
>                 "item": -2,
>                 "item_name": "default~ssd"
>             },
>             {
>                 "op": "choose_indep",
>                 "num": 0,
>                 "type": "osd"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     }
> ]
>
> # ceph osd dump | grep pool
> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17721 flags
> hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
> read_balance_score 75.00
> pool 2 '.nfs' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 17721 lfor
> 0/0/110 flags hashpspool stripe_width 0 application nfs read_balance_score
> 7.50
> pool 3 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 17721 lfor
> 0/0/125 flags hashpspool stripe_width 0 application rgw read_balance_score
> 4.99
> pool 4 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 17721 lfor 0/0/125 flags hashpspool stripe_width 0 application rgw
> read_balance_score 4.99
> pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 17721 lfor 0/0/126 flags hashpspool stripe_width 0 application rgw
> read_balance_score 4.99
> pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 17721 lfor 0/0/128 flags hashpspool stripe_width 0 pg_autoscale_bias 4
> application rgw read_balance_score 7.53
> pool 7 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
> 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 17721 lfor 0/0/222 flags hashpspool stripe_width 0 pg_autoscale_bias 4
> application rgw read_balance_score 4.97
> pool 10 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_rule
> 0 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode on
> last_change 17721 lfor 0/0/1044 flags hashpspool,bulk stripe_width 0
> pg_num_max 2048 application rgw read_balance_score 1.80
> pool 11 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 17721 lfor 0/0/319 flags hashpspool stripe_width 0 application
> rgw read_balance_score 5.01
> pool 12 'cephfs.cephfs01.data' erasure profile 8k2m size 10 min_size 9
> crush_rule 1 object_hash rjenkins pg_num 144 pgp_num 16 pg_num_target 1024
> pgp_num_target 1024 autoscale_mode warn last_change 17721 lfor 0/0/16936
> flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 32768
> pg_num_max 1024 application cephfs
> pool 13 'cephfs.cephfs01.metadata' replicated size 4 min_size 2 crush_rule
> 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change
> 17721 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
> recovery_priority 5 application cephfs read_balance_score 80.00
>
> On Sun, 5 Jan 2025 at 15:12, Anthony D'Atri <anthony.datri@xxxxxxxxx>
> wrote:
>
>>
>> >> What reweighs have been set for the top OSDs (ceph osd df tree)?
>> >>
>> > Right now they are all at 1.0. I had to lower them to something close to
>> > 0.2 in order to free up space but I changed them back to 1.0. Should I
>> > lower them while the backfill is happening?
>>
>> Old-style legacy override reweights don’t mesh well with the balancer.
>>  Best to leave them at 1.00.
>>
>> 0.2 is pretty extreme, back in the day I rarely went below 0.8.
>>
>> >> ```
>> >> "optimize_result": "Too many objects (0.355160 > 0.050000) are
>> misplaced;
>> >> try again late
>> >> ```
>>
>> That should clear.  The balancer doesn’t want to stir up trouble if the
>> cluster already has a bunch of backfill / recovery going on.  Patience!
>>
>> >> default.rgw.buckets.data    10  1024  197 TiB  133.75M  592 TiB  93.69
>> >>    13 TiB
>> >> default.rgw.buckets.non-ec  11    32   78 MiB    1.43M   17 GiB
>>
>> That’s odd that the data pool is that full but the others aren’t.
>>
>> Please send `ceph osd crush rule dump `.  And `ceph osd dump | grep pool`
>>
>>
>> >>
>> >> I also tried changing the following but it does not seem to persist:
>>
>> Could be an mclock thing.
>>
>> >> 1. Why I ended up with so many misplaced PG's since there were no
>> changes
>> >> on the cluster: number of osd's, hosts, etc.
>>
>> Probably a result of the autoscaler splitting PGs or of some change to
>> CRUSH rules such that some data can’t be placed.
>>
>> >> 2. Is it ok to change the target_max_misplaced_ratio to something
>> higher
>> >> than .05 so the autobalancer would work and I wouldn't have to
>> constantly
>> >> rebalance the osd's manually?
>>
>> I wouldn’t, that’s a symptom not the disease.
>> >> Bruno
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> >>
>> >>
>> >>
>> >
>> > --
>> > Bruno Gomes Pessanha
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
>
> --
> Bruno Gomes Pessanha
>

-- 
Bruno Gomes Pessanha
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx