Re: Many misplaced PG's, full OSD's and a good amount of manual intervention to keep my Ceph cluster alive.

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Sat, 4 Jan 2025 18:55:36 +0100

Hi,

your cephfs.cephfs01.data pool currently has 144 PGs. So this pool seems 
to be resizing, e.g. from 128 PGs to 256 PGs. Do you use the autoscalar 
or did you trigger a manual PG increment of the pool?

You can check this with the output of "ceph osd pool ls detail". It 
shows the current and target number of PGs and PGPs for all pools.

Nonetheless changing the number of PGs in a pool will always result in 
data movement, and this will temporarily use more data. You can ignore 
the not-scrubbed-in-time warnings for the moment, the PG will be 
scrubbed again after the pool resize is finished. You should have an eye 
on the free space of the OSDs (e.g. ceph osd df tree). There are already 
some OSDs above certain thresholds, and these will be blocking further 
backfilling. It is still running for 5 PGs, but it might take a lot of 
time at that rate.

You have two options: be patient and let the cluster settle by itself, 
or try to speed up the backfilling. I recommend waiting and monitoring 
the cluster (and especially the OSD free space). If you like to live 
dangerous and also have a good and reliable backup of all your data, you 
can try two approaches to speed up backfilling:

1. What Laimis wrote in the other mails. It essentially boils down to 
giving backfilling a higher priority than client I/O. One addition: the 
osd_max_backfills setting is ignored if the cluster is using the mclock 
scheduler (afaik standard in ceph reef). It might be worth to change to 
the wpq scheduler to have a better control over backfilling, the mclock 
scheduler is too complicated when it comes to this matter.

2. The first approach might speed up I/O, but it does not solve the 
problem of OSDs blocking backfilling due to low space. This is 
controlled by the mon_osd_backfillfull_ratio setting, which defaults to 
90%. In your case this means that 1.5 TB is still available, but 
backfilling is blocked. Raising the threshold might enable more OSDs to 
start backfilling, which will speed up the overall process. Given the 
size of your PGs (500GB according to Laimis), this might be a dangerous 
operation. I'm not sure whether a running backfilling of an individual 
PG is interrupted if the threshold is reached, or whether it only 
controls starting _new_ backfills. So if you want to change the 
threshold, I would propose the following steps:

  - set osd_max_backfills to 1

  - change to wpq scheduler for all OSDs (requires OSD restart)

  - check the state of the cluster, free space etc.

  - increase mon_osd_backfillfull_ratio in a very small step (e.g. 90% 
-> 91%)

  - check the cluster state, more PGs should be backfilling now

Before you start this you might want to check how many OSDs are already 
over the threshold and whether backfilling is moving data to or from 
them (check "ceph pg dump", it lists which PG is current backfilling 
from which set of OSDs to which other set).

I assume that the data pool wants to have 256 PGs (ceph prefers powers 
of two), so there will be a lot more data movement. Unless your users 
are producing a lot of new data, it should be safe to leave the cluster 
at its current state and allow it to settle. On the other hand it seems 
to have been in this state for several weeks now, so a little push might 
be necessary.

Best regards,

Burkhard

On 04.01.25 12:18, bruno.pessanha@xxxxxxxxx wrote:
Hi everyone. I'm still learning how to run Ceph properly in production. I have a a cluster (Reef 18.2.4) with 10 nodes (8 x 15TB nvme's each). There are prod 2 pools, one for RGW (3 x replica) and one for CephFS (EC 8k2m). It was all fine but one users started store more data I started seeing:
1. Very high number of misplaced PG's.
2. OSD's very unbalanced and getting 90% full
```
ceph -s

   cluster:
     id:     7805xxxe-6ba7-11ef-9cda-0xxxcxxx0
     health: HEALTH_WARN
             Low space hindering backfill (add storage if this doesn't resolve itself): 195 pgs backfill_toofull
             150 pgs not deep-scrubbed in time
             150 pgs not scrubbed in time

   services:
     mon: 5 daemons, quorum host01,host02,host03,host04,host05 (age 7w)
     mgr: host01.bwqkna(active, since 7w), standbys: host02.dycdqe
     mds: 5/5 daemons up, 6 standby
     osd: 80 osds: 80 up (since 7w), 80 in (since 4M); 323 remapped pgs
     rgw: 30 daemons active (10 hosts, 1 zones)

   data:
     volumes: 1/1 healthy
     pools:   11 pools, 1394 pgs
     objects: 159.65M objects, 279 TiB
     usage:   696 TiB used, 421 TiB / 1.1 PiB avail
     pgs:     230137879/647342099 objects misplaced (35.551%)
              1033 active+clean
              180  active+remapped+backfill_toofull
              123  active+remapped+backfill_wait
              28   active+clean+scrubbing
              15   active+remapped+backfill_wait+backfill_toofull
              10   active+clean+scrubbing+deep
              5    active+remapped+backfilling

   io:
     client:   668 MiB/s rd, 11 MiB/s wr, 1.22k op/s rd, 1.15k op/s wr
     recovery: 479 MiB/s, 283 objects/s

   progress:
     Global Recovery Event (5w)
       [=====================.......] (remaining: 11d)
```

I've been trying to rebalance the OSD's manually since the balancer does not work due to:
```
"optimize_result": "Too many objects (0.355160 > 0.050000) are misplaced; try again later",
```
I manually re-weighted the top 10 most used OSD's and the number of misplaced objects are going down very slowly. I think it could take many weeks at that ratio.
There's almost 40% of total free space but the RGW pool is almost full at ~94% I think because of OSD's unbalancing.
```
ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
ssd    1.1 PiB  421 TiB  697 TiB   697 TiB      62.34
TOTAL  1.1 PiB  421 TiB  697 TiB   697 TiB      62.34

--- POOLS ---
POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                         1     1   69 MiB       15  207 MiB      0     13 TiB
.nfs                         2    32  172 KiB       43  574 KiB      0     13 TiB
.rgw.root                    3    32  2.7 KiB        6   88 KiB      0     13 TiB
default.rgw.log              4    32  2.1 MiB      209  7.0 MiB      0     13 TiB
default.rgw.control          5    32      0 B        8      0 B      0     13 TiB
default.rgw.meta             6    32   97 KiB      280  3.5 MiB      0     13 TiB
default.rgw.buckets.index    7    32   16 GiB    2.41k   47 GiB   0.11     13 TiB
default.rgw.buckets.data    10  1024  197 TiB  133.75M  592 TiB  93.69     13 TiB
default.rgw.buckets.non-ec  11    32   78 MiB    1.43M   17 GiB   0.04     13 TiB
cephfs.cephfs01.data         12   144   83 TiB   23.99M  103 TiB  72.18     32 TiB
cephfs.cephfs01.metadata     13     1  952 MiB  483.14k  3.7 GiB      0     10 TiB
```

I also tried changing the following but it does not seem to persist:
```
# ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
osd_max_backfills = 1
osd_recovery_max_active = 0
osd_recovery_max_active_hdd = 3
osd_recovery_max_active_ssd = 10
osd_recovery_op_priority = 3
# ceph config set osd osd_max_backfills 10
# ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
osd_max_backfills = 1
osd_recovery_max_active = 0
osd_recovery_max_active_hdd = 3
osd_recovery_max_active_ssd = 10
osd_recovery_op_priority = 3
```

1. Why I ended up with so many misplaced PG's since there were no changes on the cluster: number of osd's, hosts, etc.
2. Is it ok to change the target_max_misplaced_ratio to something higher than .05 so the autobalancer would work and I wouldn't have to constantly rebalance the osd's manually?
3. Is there a way to speed up the rebalance?
4. Any other recommendation that could help to make my cluster healthy again?

Thank you!

Bruno
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx