Re: Many misplaced PG's, full OSD's and a good amount of manual intervention to keep my Ceph cluster alive.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the mail spam, but last question:
What reweighs have been set for the top OSDs (ceph osd df tree)? 
Just a guess but they might have been a bit too aggressive and caused a lot of backfilling operations.


Best,
Laimis J.

> On 4 Jan 2025, at 18:05, Laimis Juzeliūnas <laimis.juzeliunas@xxxxxxxxxx> wrote:
> 
> Hello Bruno,
> 
> Interesting case, few observations.
> 
> What’s the average size of your PGs? 
> Judging from the ceph status you have 1394 pls in total and 696TiB of used storage, that’s roughly 500GB per pg if I’m not mistaken. 
> With the backfilling limits this results in a lot of time spent per single pg due to its size. You could try increasing their number in the pools to have lighter placement groups.
> 
> Are you using mclock? If yes, you can try setting the profile to prioritise recovery operations with 'ceph config set osd osd_mclock_profile high_recovery_ops'
> 
> The max backfills configuration is an interesting one - it should persist. 
> What happens if you set it through the Ceph UI?
> 
> In general it looks like the balancer might be “fighting” with the manual OSD balancing.
> You could try turning it off and do the balancing yourself (this might be helpful: https://github.com/laimis9133/plankton-swarm).
> 
> Also probably known already but keep in mind erasure coded pools are known to be on the slower side when it comes to any data movement due to additional operations needed.
> 
> 
> Best,
> Laimis J.
> 
> 
>> On 4 Jan 2025, at 13:18, bruno.pessanha@xxxxxxxxx wrote:
>> 
>> Hi everyone. I'm still learning how to run Ceph properly in production. I have a a cluster (Reef 18.2.4) with 10 nodes (8 x 15TB nvme's each). There are prod 2 pools, one for RGW (3 x replica) and one for CephFS (EC 8k2m). It was all fine but one users started store more data I started seeing:
>> 1. Very high number of misplaced PG's.
>> 2. OSD's very unbalanced and getting 90% full
>> ```
>> ceph -s                                                             
>> 
>>  cluster:
>>    id:     7805xxxe-6ba7-11ef-9cda-0xxxcxxx0
>>    health: HEALTH_WARN
>>            Low space hindering backfill (add storage if this doesn't resolve itself): 195 pgs backfill_toofull
>>            150 pgs not deep-scrubbed in time
>>            150 pgs not scrubbed in time
>> 
>>  services:
>>    mon: 5 daemons, quorum host01,host02,host03,host04,host05 (age 7w)
>>    mgr: host01.bwqkna(active, since 7w), standbys: host02.dycdqe
>>    mds: 5/5 daemons up, 6 standby
>>    osd: 80 osds: 80 up (since 7w), 80 in (since 4M); 323 remapped pgs
>>    rgw: 30 daemons active (10 hosts, 1 zones)
>> 
>>  data:
>>    volumes: 1/1 healthy
>>    pools:   11 pools, 1394 pgs
>>    objects: 159.65M objects, 279 TiB
>>    usage:   696 TiB used, 421 TiB / 1.1 PiB avail
>>    pgs:     230137879/647342099 objects misplaced (35.551%)
>>             1033 active+clean
>>             180  active+remapped+backfill_toofull
>>             123  active+remapped+backfill_wait
>>             28   active+clean+scrubbing
>>             15   active+remapped+backfill_wait+backfill_toofull
>>             10   active+clean+scrubbing+deep
>>             5    active+remapped+backfilling
>> 
>>  io:
>>    client:   668 MiB/s rd, 11 MiB/s wr, 1.22k op/s rd, 1.15k op/s wr
>>    recovery: 479 MiB/s, 283 objects/s
>> 
>>  progress:
>>    Global Recovery Event (5w)
>>      [=====================.......] (remaining: 11d)
>> ```
>> 
>> I've been trying to rebalance the OSD's manually since the balancer does not work due to:
>> ```
>> "optimize_result": "Too many objects (0.355160 > 0.050000) are misplaced; try again later",
>> ```
>> I manually re-weighted the top 10 most used OSD's and the number of misplaced objects are going down very slowly. I think it could take many weeks at that ratio.
>> There's almost 40% of total free space but the RGW pool is almost full at ~94% I think because of OSD's unbalancing.
>> ```
>> ceph df
>> --- RAW STORAGE ---
>> CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
>> ssd    1.1 PiB  421 TiB  697 TiB   697 TiB      62.34
>> TOTAL  1.1 PiB  421 TiB  697 TiB   697 TiB      62.34
>> 
>> --- POOLS ---
>> POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
>> .mgr                         1     1   69 MiB       15  207 MiB      0     13 TiB
>> .nfs                         2    32  172 KiB       43  574 KiB      0     13 TiB
>> .rgw.root                    3    32  2.7 KiB        6   88 KiB      0     13 TiB
>> default.rgw.log              4    32  2.1 MiB      209  7.0 MiB      0     13 TiB
>> default.rgw.control          5    32      0 B        8      0 B      0     13 TiB
>> default.rgw.meta             6    32   97 KiB      280  3.5 MiB      0     13 TiB
>> default.rgw.buckets.index    7    32   16 GiB    2.41k   47 GiB   0.11     13 TiB
>> default.rgw.buckets.data    10  1024  197 TiB  133.75M  592 TiB  93.69     13 TiB
>> default.rgw.buckets.non-ec  11    32   78 MiB    1.43M   17 GiB   0.04     13 TiB
>> cephfs.cephfs01.data         12   144   83 TiB   23.99M  103 TiB  72.18     32 TiB
>> cephfs.cephfs01.metadata     13     1  952 MiB  483.14k  3.7 GiB      0     10 TiB
>> ```
>> 
>> I also tried changing the following but it does not seem to persist:
>> ```
>> # ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
>> osd_max_backfills = 1
>> osd_recovery_max_active = 0
>> osd_recovery_max_active_hdd = 3
>> osd_recovery_max_active_ssd = 10
>> osd_recovery_op_priority = 3
>> # ceph config set osd osd_max_backfills 10
>> # ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
>> osd_max_backfills = 1
>> osd_recovery_max_active = 0
>> osd_recovery_max_active_hdd = 3
>> osd_recovery_max_active_ssd = 10
>> osd_recovery_op_priority = 3
>> ```
>> 
>> 1. Why I ended up with so many misplaced PG's since there were no changes on the cluster: number of osd's, hosts, etc.
>> 2. Is it ok to change the target_max_misplaced_ratio to something higher than .05 so the autobalancer would work and I wouldn't have to constantly rebalance the osd's manually?
>> 3. Is there a way to speed up the rebalance?
>> 4. Any other recommendation that could help to make my cluster healthy again?
>> 
>> Thank you!
>> 
>> Bruno
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux