Hi,
your cephfs.cephfs01.data pool currently has 144 PGs. So this pool seems
to be resizing, e.g. from 128 PGs to 256 PGs. Do you use the autoscalar
or did you trigger a manual PG increment of the pool?
You can check this with the output of "ceph osd pool ls detail". It
shows the current and target number of PGs and PGPs for all pools.
Nonetheless changing the number of PGs in a pool will always result in
data movement, and this will temporarily use more data. You can ignore
the not-scrubbed-in-time warnings for the moment, the PG will be
scrubbed again after the pool resize is finished. You should have an eye
on the free space of the OSDs (e.g. ceph osd df tree). There are already
some OSDs above certain thresholds, and these will be blocking further
backfilling. It is still running for 5 PGs, but it might take a lot of
time at that rate.
You have two options: be patient and let the cluster settle by itself,
or try to speed up the backfilling. I recommend waiting and monitoring
the cluster (and especially the OSD free space). If you like to live
dangerous and also have a good and reliable backup of all your data, you
can try two approaches to speed up backfilling:
1. What Laimis wrote in the other mails. It essentially boils down to
giving backfilling a higher priority than client I/O. One addition: the
osd_max_backfills setting is ignored if the cluster is using the mclock
scheduler (afaik standard in ceph reef). It might be worth to change to
the wpq scheduler to have a better control over backfilling, the mclock
scheduler is too complicated when it comes to this matter.
2. The first approach might speed up I/O, but it does not solve the
problem of OSDs blocking backfilling due to low space. This is
controlled by the mon_osd_backfillfull_ratio setting, which defaults to
90%. In your case this means that 1.5 TB is still available, but
backfilling is blocked. Raising the threshold might enable more OSDs to
start backfilling, which will speed up the overall process. Given the
size of your PGs (500GB according to Laimis), this might be a dangerous
operation. I'm not sure whether a running backfilling of an individual
PG is interrupted if the threshold is reached, or whether it only
controls starting _new_ backfills. So if you want to change the
threshold, I would propose the following steps:
- set osd_max_backfills to 1
- change to wpq scheduler for all OSDs (requires OSD restart)
- check the state of the cluster, free space etc.
- increase mon_osd_backfillfull_ratio in a very small step (e.g. 90%
-> 91%)
- check the cluster state, more PGs should be backfilling now
Before you start this you might want to check how many OSDs are already
over the threshold and whether backfilling is moving data to or from
them (check "ceph pg dump", it lists which PG is current backfilling
from which set of OSDs to which other set).
I assume that the data pool wants to have 256 PGs (ceph prefers powers
of two), so there will be a lot more data movement. Unless your users
are producing a lot of new data, it should be safe to leave the cluster
at its current state and allow it to settle. On the other hand it seems
to have been in this state for several weeks now, so a little push might
be necessary.
Best regards,
Burkhard
On 04.01.25 12:18, bruno.pessanha@xxxxxxxxx wrote:
Hi everyone. I'm still learning how to run Ceph properly in production. I have a a cluster (Reef 18.2.4) with 10 nodes (8 x 15TB nvme's each). There are prod 2 pools, one for RGW (3 x replica) and one for CephFS (EC 8k2m). It was all fine but one users started store more data I started seeing:
1. Very high number of misplaced PG's.
2. OSD's very unbalanced and getting 90% full
```
ceph -s
cluster:
id: 7805xxxe-6ba7-11ef-9cda-0xxxcxxx0
health: HEALTH_WARN
Low space hindering backfill (add storage if this doesn't resolve itself): 195 pgs backfill_toofull
150 pgs not deep-scrubbed in time
150 pgs not scrubbed in time
services:
mon: 5 daemons, quorum host01,host02,host03,host04,host05 (age 7w)
mgr: host01.bwqkna(active, since 7w), standbys: host02.dycdqe
mds: 5/5 daemons up, 6 standby
osd: 80 osds: 80 up (since 7w), 80 in (since 4M); 323 remapped pgs
rgw: 30 daemons active (10 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 11 pools, 1394 pgs
objects: 159.65M objects, 279 TiB
usage: 696 TiB used, 421 TiB / 1.1 PiB avail
pgs: 230137879/647342099 objects misplaced (35.551%)
1033 active+clean
180 active+remapped+backfill_toofull
123 active+remapped+backfill_wait
28 active+clean+scrubbing
15 active+remapped+backfill_wait+backfill_toofull
10 active+clean+scrubbing+deep
5 active+remapped+backfilling
io:
client: 668 MiB/s rd, 11 MiB/s wr, 1.22k op/s rd, 1.15k op/s wr
recovery: 479 MiB/s, 283 objects/s
progress:
Global Recovery Event (5w)
[=====================.......] (remaining: 11d)
```
I've been trying to rebalance the OSD's manually since the balancer does not work due to:
```
"optimize_result": "Too many objects (0.355160 > 0.050000) are misplaced; try again later",
```
I manually re-weighted the top 10 most used OSD's and the number of misplaced objects are going down very slowly. I think it could take many weeks at that ratio.
There's almost 40% of total free space but the RGW pool is almost full at ~94% I think because of OSD's unbalancing.
```
ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 1.1 PiB 421 TiB 697 TiB 697 TiB 62.34
TOTAL 1.1 PiB 421 TiB 697 TiB 697 TiB 62.34
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 69 MiB 15 207 MiB 0 13 TiB
.nfs 2 32 172 KiB 43 574 KiB 0 13 TiB
.rgw.root 3 32 2.7 KiB 6 88 KiB 0 13 TiB
default.rgw.log 4 32 2.1 MiB 209 7.0 MiB 0 13 TiB
default.rgw.control 5 32 0 B 8 0 B 0 13 TiB
default.rgw.meta 6 32 97 KiB 280 3.5 MiB 0 13 TiB
default.rgw.buckets.index 7 32 16 GiB 2.41k 47 GiB 0.11 13 TiB
default.rgw.buckets.data 10 1024 197 TiB 133.75M 592 TiB 93.69 13 TiB
default.rgw.buckets.non-ec 11 32 78 MiB 1.43M 17 GiB 0.04 13 TiB
cephfs.cephfs01.data 12 144 83 TiB 23.99M 103 TiB 72.18 32 TiB
cephfs.cephfs01.metadata 13 1 952 MiB 483.14k 3.7 GiB 0 10 TiB
```
I also tried changing the following but it does not seem to persist:
```
# ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
osd_max_backfills = 1
osd_recovery_max_active = 0
osd_recovery_max_active_hdd = 3
osd_recovery_max_active_ssd = 10
osd_recovery_op_priority = 3
# ceph config set osd osd_max_backfills 10
# ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
osd_max_backfills = 1
osd_recovery_max_active = 0
osd_recovery_max_active_hdd = 3
osd_recovery_max_active_ssd = 10
osd_recovery_op_priority = 3
```
1. Why I ended up with so many misplaced PG's since there were no changes on the cluster: number of osd's, hosts, etc.
2. Is it ok to change the target_max_misplaced_ratio to something higher than .05 so the autobalancer would work and I wouldn't have to constantly rebalance the osd's manually?
3. Is there a way to speed up the rebalance?
4. Any other recommendation that could help to make my cluster healthy again?
Thank you!
Bruno
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx