Re: Many misplaced PG's, full OSD's and a good amount of manual intervention to keep my Ceph cluster alive.

Bruno Gomes Pessanha <bruno.pessanha@xxxxxxxxx> · Sun, 5 Jan 2025 12:50:48 +0100

Are you using mclock?
I'm not sure. I'll read about it.

What happens if you set it through the Ceph UI?
There are global, mon, mgr, osd, mds and client values that I changed to 10 (the default is 1):

...but the output is still:
# ceph-conf --show-config | egrep osd_max_backfills
osd_max_backfills = 1

You could try turning it off and do the balancing yourself (this might be helpful: https://github.com/laimis9133/plankton-swarm).
I'll definitely look into that. Thanks a bunch! 

On Sat, 4 Jan 2025 at 17:06, Laimis Juzeliūnas <laimis.juzeliunas@xxxxxxxxxx> wrote:
Hello Bruno,
Interesting case, few observations.

What’s the average size of your PGs? 
Judging from the ceph status you have 1394 pls in total and 696TiB of used storage, that’s roughly 500GB per pg if I’m not mistaken. 
With the backfilling limits this results in a lot of time spent per single pg due to its size. You could try increasing their number in the pools to have lighter placement groups.

Are you using mclock? If yes, you can try setting the profile to prioritise recovery operations with 'ceph config set osd osd_mclock_profile high_recovery_ops'

The max backfills configuration is an interesting one - it should persist. 
What happens if you set it through the Ceph UI?

In general it looks like the balancer might be “fighting” with the manual OSD balancing.
You could try turning it off and do the balancing yourself (this might be helpful: https://github.com/laimis9133/plankton-swarm).

Also probably known already but keep in mind erasure coded pools are known to be on the slower side when it comes to any data movement due to additional operations needed.

Best,
Laimis J.

On 4 Jan 2025, at 13:18, bruno.pessanha@xxxxxxxxx wrote:

Hi everyone. I'm still learning how to run Ceph properly in production. I have a a cluster (Reef 18.2.4) with 10 nodes (8 x 15TB nvme's each). There are prod 2 pools, one for RGW (3 x replica) and one for CephFS (EC 8k2m). It was all fine but one users started store more data I started seeing:
1. Very high number of misplaced PG's.
2. OSD's very unbalanced and getting 90% full
```
ceph -s                                                             

  cluster:
    id:     7805xxxe-6ba7-11ef-9cda-0xxxcxxx0
    health: HEALTH_WARN
            Low space hindering backfill (add storage if this doesn't resolve itself): 195 pgs backfill_toofull
            150 pgs not deep-scrubbed in time
            150 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum host01,host02,host03,host04,host05 (age 7w)
    mgr: host01.bwqkna(active, since 7w), standbys: host02.dycdqe
    mds: 5/5 daemons up, 6 standby
    osd: 80 osds: 80 up (since 7w), 80 in (since 4M); 323 remapped pgs
    rgw: 30 daemons active (10 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   11 pools, 1394 pgs
    objects: 159.65M objects, 279 TiB
    usage:   696 TiB used, 421 TiB / 1.1 PiB avail
    pgs:     230137879/647342099 objects misplaced (35.551%)
             1033 active+clean
             180  active+remapped+backfill_toofull
             123  active+remapped+backfill_wait
             28   active+clean+scrubbing
             15   active+remapped+backfill_wait+backfill_toofull
             10   active+clean+scrubbing+deep
             5    active+remapped+backfilling

  io:
    client:   668 MiB/s rd, 11 MiB/s wr, 1.22k op/s rd, 1.15k op/s wr
    recovery: 479 MiB/s, 283 objects/s

  progress:
    Global Recovery Event (5w)
      [=====================.......] (remaining: 11d)
```

I've been trying to rebalance the OSD's manually since the balancer does not work due to:
```
"optimize_result": "Too many objects (0.355160 > 0.050000) are misplaced; try again later",
```
I manually re-weighted the top 10 most used OSD's and the number of misplaced objects are going down very slowly. I think it could take many weeks at that ratio.
There's almost 40% of total free space but the RGW pool is almost full at ~94% I think because of OSD's unbalancing.
```
ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
ssd    1.1 PiB  421 TiB  697 TiB   697 TiB      62.34
TOTAL  1.1 PiB  421 TiB  697 TiB   697 TiB      62.34

--- POOLS ---
POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                         1     1   69 MiB       15  207 MiB      0     13 TiB
.nfs                         2    32  172 KiB       43  574 KiB      0     13 TiB
.rgw.root                    3    32  2.7 KiB        6   88 KiB      0     13 TiB
default.rgw.log              4    32  2.1 MiB      209  7.0 MiB      0     13 TiB
default.rgw.control          5    32      0 B        8      0 B      0     13 TiB
default.rgw.meta             6    32   97 KiB      280  3.5 MiB      0     13 TiB
default.rgw.buckets.index    7    32   16 GiB    2.41k   47 GiB   0.11     13 TiB
default.rgw.buckets.data    10  1024  197 TiB  133.75M  592 TiB  93.69     13 TiB
default.rgw.buckets.non-ec  11    32   78 MiB    1.43M   17 GiB   0.04     13 TiB
cephfs.cephfs01.data         12   144   83 TiB   23.99M  103 TiB  72.18     32 TiB
cephfs.cephfs01.metadata     13     1  952 MiB  483.14k  3.7 GiB      0     10 TiB
```

I also tried changing the following but it does not seem to persist:
```
# ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
osd_max_backfills = 1
osd_recovery_max_active = 0
osd_recovery_max_active_hdd = 3
osd_recovery_max_active_ssd = 10
osd_recovery_op_priority = 3
# ceph config set osd osd_max_backfills 10
# ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
osd_max_backfills = 1
osd_recovery_max_active = 0
osd_recovery_max_active_hdd = 3
osd_recovery_max_active_ssd = 10
osd_recovery_op_priority = 3
```

1. Why I ended up with so many misplaced PG's since there were no changes on the cluster: number of osd's, hosts, etc.
2. Is it ok to change the target_max_misplaced_ratio to something higher than .05 so the autobalancer would work and I wouldn't have to constantly rebalance the osd's manually?
3. Is there a way to speed up the rebalance?
4. Any other recommendation that could help to make my cluster healthy again?

Thank you!

Bruno
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Bruno Gomes Pessanha
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx