Re: Many misplaced PG's, full OSD's and a good amount of manual intervention to keep my Ceph cluster alive.

Bruno Gomes Pessanha <bruno.pessanha@xxxxxxxxx> · Sun, 5 Jan 2025 12:52:29 +0100

>
> What reweighs have been set for the top OSDs (ceph osd df tree)?
>
Right now they are all at 1.0. I had to lower them to something close to
0.2 in order to free up space but I changed them back to 1.0. Should I
lower them while the backfill is happening?

On Sat, 4 Jan 2025 at 17:18, Laimis Juzeliūnas <laimis.juzeliunas@xxxxxxxxxx>
wrote:

> Sorry for the mail spam, but last question:
> What reweighs have been set for the top OSDs (ceph osd df tree)?
> Just a guess but they might have been a bit too aggressive and caused a
> lot of backfilling operations.
>
>
> Best,
> *Laimis J.*
>
> On 4 Jan 2025, at 18:05, Laimis Juzeliūnas <laimis.juzeliunas@xxxxxxxxxx>
> wrote:
>
> Hello Bruno,
>
> Interesting case, few observations.
>
> What’s the average size of your PGs?
> Judging from the ceph status you have 1394 pls in total and 696TiB of
> used storage, that’s roughly 500GB per pg if I’m not mistaken.
> With the backfilling limits this results in a lot of time spent per single
> pg due to its size. You could try increasing their number in the pools to
> have lighter placement groups.
>
> Are you using mclock? If yes, you can try setting the profile to
> prioritise recovery operations with 'ceph config set osd
> osd_mclock_profile high_recovery_ops'
>
> The max backfills configuration is an interesting one - it should persist.
> What happens if you set it through the Ceph UI?
>
> In general it looks like the balancer might be “fighting” with the manual
> OSD balancing.
> You could try turning it off and do the balancing yourself (this might be
> helpful: https://github.com/laimis9133/plankton-swarm).
>
> Also probably known already but keep in mind erasure coded pools are known
> to be on the slower side when it comes to any data movement due to
> additional operations needed.
>
>
> Best,
> *Laimis J.*
>
>
> On 4 Jan 2025, at 13:18, bruno.pessanha@xxxxxxxxx wrote:
>
> Hi everyone. I'm still learning how to run Ceph properly in production. I
> have a a cluster (Reef 18.2.4) with 10 nodes (8 x 15TB nvme's each). There
> are prod 2 pools, one for RGW (3 x replica) and one for CephFS (EC 8k2m).
> It was all fine but one users started store more data I started seeing:
> 1. Very high number of misplaced PG's.
> 2. OSD's very unbalanced and getting 90% full
> ```
> ceph -s
>
>  cluster:
>    id:     7805xxxe-6ba7-11ef-9cda-0xxxcxxx0
>    health: HEALTH_WARN
>            Low space hindering backfill (add storage if this doesn't
> resolve itself): 195 pgs backfill_toofull
>            150 pgs not deep-scrubbed in time
>            150 pgs not scrubbed in time
>
>  services:
>    mon: 5 daemons, quorum host01,host02,host03,host04,host05 (age 7w)
>    mgr: host01.bwqkna(active, since 7w), standbys: host02.dycdqe
>    mds: 5/5 daemons up, 6 standby
>    osd: 80 osds: 80 up (since 7w), 80 in (since 4M); 323 remapped pgs
>    rgw: 30 daemons active (10 hosts, 1 zones)
>
>  data:
>    volumes: 1/1 healthy
>    pools:   11 pools, 1394 pgs
>    objects: 159.65M objects, 279 TiB
>    usage:   696 TiB used, 421 TiB / 1.1 PiB avail
>    pgs:     230137879/647342099 objects misplaced (35.551%)
>             1033 active+clean
>             180  active+remapped+backfill_toofull
>             123  active+remapped+backfill_wait
>             28   active+clean+scrubbing
>             15   active+remapped+backfill_wait+backfill_toofull
>             10   active+clean+scrubbing+deep
>             5    active+remapped+backfilling
>
>  io:
>    client:   668 MiB/s rd, 11 MiB/s wr, 1.22k op/s rd, 1.15k op/s wr
>    recovery: 479 MiB/s, 283 objects/s
>
>  progress:
>    Global Recovery Event (5w)
>      [=====================.......] (remaining: 11d)
> ```
>
> I've been trying to rebalance the OSD's manually since the balancer does
> not work due to:
> ```
> "optimize_result": "Too many objects (0.355160 > 0.050000) are misplaced;
> try again later",
> ```
> I manually re-weighted the top 10 most used OSD's and the number of
> misplaced objects are going down very slowly. I think it could take many
> weeks at that ratio.
> There's almost 40% of total free space but the RGW pool is almost full at
> ~94% I think because of OSD's unbalancing.
> ```
> ceph df
> --- RAW STORAGE ---
> CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
> ssd    1.1 PiB  421 TiB  697 TiB   697 TiB      62.34
> TOTAL  1.1 PiB  421 TiB  697 TiB   697 TiB      62.34
>
> --- POOLS ---
> POOL                        ID   PGS   STORED  OBJECTS     USED  %USED
>  MAX AVAIL
> .mgr                         1     1   69 MiB       15  207 MiB      0
>     13 TiB
> .nfs                         2    32  172 KiB       43  574 KiB      0
>     13 TiB
> .rgw.root                    3    32  2.7 KiB        6   88 KiB      0
>     13 TiB
> default.rgw.log              4    32  2.1 MiB      209  7.0 MiB      0
>     13 TiB
> default.rgw.control          5    32      0 B        8      0 B      0
>     13 TiB
> default.rgw.meta             6    32   97 KiB      280  3.5 MiB      0
>     13 TiB
> default.rgw.buckets.index    7    32   16 GiB    2.41k   47 GiB   0.11
>     13 TiB
> default.rgw.buckets.data    10  1024  197 TiB  133.75M  592 TiB  93.69
>     13 TiB
> default.rgw.buckets.non-ec  11    32   78 MiB    1.43M   17 GiB   0.04
>     13 TiB
> cephfs.cephfs01.data         12   144   83 TiB   23.99M  103 TiB  72.18
>     32 TiB
> cephfs.cephfs01.metadata     13     1  952 MiB  483.14k  3.7 GiB      0
>     10 TiB
> ```
>
> I also tried changing the following but it does not seem to persist:
> ```
> # ceph-conf --show-config | egrep
> "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
> osd_max_backfills = 1
> osd_recovery_max_active = 0
> osd_recovery_max_active_hdd = 3
> osd_recovery_max_active_ssd = 10
> osd_recovery_op_priority = 3
> # ceph config set osd osd_max_backfills 10
> # ceph-conf --show-config | egrep
> "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
> osd_max_backfills = 1
> osd_recovery_max_active = 0
> osd_recovery_max_active_hdd = 3
> osd_recovery_max_active_ssd = 10
> osd_recovery_op_priority = 3
> ```
>
> 1. Why I ended up with so many misplaced PG's since there were no changes
> on the cluster: number of osd's, hosts, etc.
> 2. Is it ok to change the target_max_misplaced_ratio to something higher
> than .05 so the autobalancer would work and I wouldn't have to constantly
> rebalance the osd's manually?
> 3. Is there a way to speed up the rebalance?
> 4. Any other recommendation that could help to make my cluster healthy
> again?
>
> Thank you!
>
> Bruno
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
>

-- 
Bruno Gomes Pessanha
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx