Re: Many misplaced PG's, full OSD's and a good amount of manual intervention to keep my Ceph cluster alive.

Bruno Gomes Pessanha <bruno.pessanha@xxxxxxxxx> · Sun, 5 Jan 2025 13:11:44 +0100

>
> Do you use the autoscalar or did you trigger a manual PG increment of the
> pool?

The pool had autoscale enabled until 2 days ago when I thought it was
better to change things manually in order to have a more deterministic
result. Yes, I wanted to increase from "1" to something like "1024" but it
looks like it was capped to the 144 no matter what I do:
# ceph osd pool get cephfs.cephfs01.data pg_num
pg_num: 144
# ceph osd pool set cephfs.cephfs01.data pg_num 1024
# ceph osd pool get cephfs.cephfs01.data pg_num
pg_num: 144

You can check this with the output of "ceph osd pool ls detail". It shows
> the current and target number of PGs and PGPs for all pools.

That's a very useful command! Thanks.
# ceph osd pool ls detail |grep cephfs.cephfs01.data
pool 12 'cephfs.cephfs01.data' erasure profile 8k2m size 10 min_size 9
crush_rule 1 object_hash rjenkins pg_num 144 pgp_num 16 pg_num_target 1024
pgp_num_target 1024 autoscale_mode warn last_change 17398 lfor 0/0/16936
flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 32768
pg_num_max 1024 application cephfs

You can ignore the not-scrubbed-in-time warnings for the moment, the PG
> will be scrubbed again after the pool resize is finished.

Will do.

Your suggestion #2 makes total sense to me. I'll fine tune
mon_osd_backfillfull_ratio, leave the backfills running a few more days and
keep monitoring. My only concern is if users start to store a lot more data
all of a sudden. I'll keep an eye on it.

Is it correct to say that every PG/OSD change can potentially cause data
misplacements, unbalanced osd's and long backfills? I'll be way more
careful before tuning it if that's the case.

Thank you both so much! It definitely helped me to understand Ceph better.
It is kind of a steep curve :).

On Sat, 4 Jan 2025 at 19:03, Burkhard Linke <
Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

> Hi,
>
>
> your cephfs.cephfs01.data pool currently has 144 PGs. So this pool seems
> to be resizing, e.g. from 128 PGs to 256 PGs. Do you use the autoscalar
> or did you trigger a manual PG increment of the pool?
>
> You can check this with the output of "ceph osd pool ls detail". It
> shows the current and target number of PGs and PGPs for all pools.
>
> Nonetheless changing the number of PGs in a pool will always result in
> data movement, and this will temporarily use more data. You can ignore
> the not-scrubbed-in-time warnings for the moment, the PG will be
> scrubbed again after the pool resize is finished. You should have an eye
> on the free space of the OSDs (e.g. ceph osd df tree). There are already
> some OSDs above certain thresholds, and these will be blocking further
> backfilling. It is still running for 5 PGs, but it might take a lot of
> time at that rate.
>
> You have two options: be patient and let the cluster settle by itself,
> or try to speed up the backfilling. I recommend waiting and monitoring
> the cluster (and especially the OSD free space). If you like to live
> dangerous and also have a good and reliable backup of all your data, you
> can try two approaches to speed up backfilling:
>
> 1. What Laimis wrote in the other mails. It essentially boils down to
> giving backfilling a higher priority than client I/O. One addition: the
> osd_max_backfills setting is ignored if the cluster is using the mclock
> scheduler (afaik standard in ceph reef). It might be worth to change to
> the wpq scheduler to have a better control over backfilling, the mclock
> scheduler is too complicated when it comes to this matter.
>
> 2. The first approach might speed up I/O, but it does not solve the
> problem of OSDs blocking backfilling due to low space. This is
> controlled by the mon_osd_backfillfull_ratio setting, which defaults to
> 90%. In your case this means that 1.5 TB is still available, but
> backfilling is blocked. Raising the threshold might enable more OSDs to
> start backfilling, which will speed up the overall process. Given the
> size of your PGs (500GB according to Laimis), this might be a dangerous
> operation. I'm not sure whether a running backfilling of an individual
> PG is interrupted if the threshold is reached, or whether it only
> controls starting _new_ backfills. So if you want to change the
> threshold, I would propose the following steps:
>
>    - set osd_max_backfills to 1
>
>    - change to wpq scheduler for all OSDs (requires OSD restart)
>
>    - check the state of the cluster, free space etc.
>
>    - increase mon_osd_backfillfull_ratio in a very small step (e.g. 90%
> -> 91%)
>
>    - check the cluster state, more PGs should be backfilling now
>
> Before you start this you might want to check how many OSDs are already
> over the threshold and whether backfilling is moving data to or from
> them (check "ceph pg dump", it lists which PG is current backfilling
> from which set of OSDs to which other set).
>
> I assume that the data pool wants to have 256 PGs (ceph prefers powers
> of two), so there will be a lot more data movement. Unless your users
> are producing a lot of new data, it should be safe to leave the cluster
> at its current state and allow it to settle. On the other hand it seems
> to have been in this state for several weeks now, so a little push might
> be necessary.
>
>
> Best regards,
>
> Burkhard
>
> On 04.01.25 12:18, bruno.pessanha@xxxxxxxxx wrote:
> > Hi everyone. I'm still learning how to run Ceph properly in production.
> I have a a cluster (Reef 18.2.4) with 10 nodes (8 x 15TB nvme's each).
> There are prod 2 pools, one for RGW (3 x replica) and one for CephFS (EC
> 8k2m). It was all fine but one users started store more data I started
> seeing:
> > 1. Very high number of misplaced PG's.
> > 2. OSD's very unbalanced and getting 90% full
> > ```
> > ceph -s
> >
> >    cluster:
> >      id:     7805xxxe-6ba7-11ef-9cda-0xxxcxxx0
> >      health: HEALTH_WARN
> >              Low space hindering backfill (add storage if this doesn't
> resolve itself): 195 pgs backfill_toofull
> >              150 pgs not deep-scrubbed in time
> >              150 pgs not scrubbed in time
> >
> >    services:
> >      mon: 5 daemons, quorum host01,host02,host03,host04,host05 (age 7w)
> >      mgr: host01.bwqkna(active, since 7w), standbys: host02.dycdqe
> >      mds: 5/5 daemons up, 6 standby
> >      osd: 80 osds: 80 up (since 7w), 80 in (since 4M); 323 remapped pgs
> >      rgw: 30 daemons active (10 hosts, 1 zones)
> >
> >    data:
> >      volumes: 1/1 healthy
> >      pools:   11 pools, 1394 pgs
> >      objects: 159.65M objects, 279 TiB
> >      usage:   696 TiB used, 421 TiB / 1.1 PiB avail
> >      pgs:     230137879/647342099 objects misplaced (35.551%)
> >               1033 active+clean
> >               180  active+remapped+backfill_toofull
> >               123  active+remapped+backfill_wait
> >               28   active+clean+scrubbing
> >               15   active+remapped+backfill_wait+backfill_toofull
> >               10   active+clean+scrubbing+deep
> >               5    active+remapped+backfilling
> >
> >    io:
> >      client:   668 MiB/s rd, 11 MiB/s wr, 1.22k op/s rd, 1.15k op/s wr
> >      recovery: 479 MiB/s, 283 objects/s
> >
> >    progress:
> >      Global Recovery Event (5w)
> >        [=====================.......] (remaining: 11d)
> > ```
> >
> > I've been trying to rebalance the OSD's manually since the balancer does
> not work due to:
> > ```
> > "optimize_result": "Too many objects (0.355160 > 0.050000) are
> misplaced; try again later",
> > ```
> > I manually re-weighted the top 10 most used OSD's and the number of
> misplaced objects are going down very slowly. I think it could take many
> weeks at that ratio.
> > There's almost 40% of total free space but the RGW pool is almost full
> at ~94% I think because of OSD's unbalancing.
> > ```
> > ceph df
> > --- RAW STORAGE ---
> > CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
> > ssd    1.1 PiB  421 TiB  697 TiB   697 TiB      62.34
> > TOTAL  1.1 PiB  421 TiB  697 TiB   697 TiB      62.34
> >
> > --- POOLS ---
> > POOL                        ID   PGS   STORED  OBJECTS     USED  %USED
> MAX AVAIL
> > .mgr                         1     1   69 MiB       15  207 MiB      0
>    13 TiB
> > .nfs                         2    32  172 KiB       43  574 KiB      0
>    13 TiB
> > .rgw.root                    3    32  2.7 KiB        6   88 KiB      0
>    13 TiB
> > default.rgw.log              4    32  2.1 MiB      209  7.0 MiB      0
>    13 TiB
> > default.rgw.control          5    32      0 B        8      0 B      0
>    13 TiB
> > default.rgw.meta             6    32   97 KiB      280  3.5 MiB      0
>    13 TiB
> > default.rgw.buckets.index    7    32   16 GiB    2.41k   47 GiB   0.11
>    13 TiB
> > default.rgw.buckets.data    10  1024  197 TiB  133.75M  592 TiB  93.69
>    13 TiB
> > default.rgw.buckets.non-ec  11    32   78 MiB    1.43M   17 GiB   0.04
>    13 TiB
> > cephfs.cephfs01.data         12   144   83 TiB   23.99M  103 TiB  72.18
>    32 TiB
> > cephfs.cephfs01.metadata     13     1  952 MiB  483.14k  3.7 GiB      0
>    10 TiB
> > ```
> >
> > I also tried changing the following but it does not seem to persist:
> > ```
> > # ceph-conf --show-config | egrep
> "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
> > osd_max_backfills = 1
> > osd_recovery_max_active = 0
> > osd_recovery_max_active_hdd = 3
> > osd_recovery_max_active_ssd = 10
> > osd_recovery_op_priority = 3
> > # ceph config set osd osd_max_backfills 10
> > # ceph-conf --show-config | egrep
> "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
> > osd_max_backfills = 1
> > osd_recovery_max_active = 0
> > osd_recovery_max_active_hdd = 3
> > osd_recovery_max_active_ssd = 10
> > osd_recovery_op_priority = 3
> > ```
> >
> > 1. Why I ended up with so many misplaced PG's since there were no
> changes on the cluster: number of osd's, hosts, etc.
> > 2. Is it ok to change the target_max_misplaced_ratio to something higher
> than .05 so the autobalancer would work and I wouldn't have to constantly
> rebalance the osd's manually?
> > 3. Is there a way to speed up the rebalance?
> > 4. Any other recommendation that could help to make my cluster healthy
> again?
> >
> > Thank you!
> >
> > Bruno
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Bruno Gomes Pessanha
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx