Re: Many misplaced PG's, full OSD's and a good amount of manual intervention to keep my Ceph cluster alive.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everyone. Yes. All the tips definitely helped! Now I have more free space in the pools, the number of misplaced PG's decreased a lot and lower std deviation of the usage of OSD's. The storage looks way healthier now. Thanks a bunch!

I'm only confused by the number of misplaced PG's which never goes below 5%. Every time it hits 5% it goes up and down like shown in this quite interesting graph:
image.png

Any idea why that might be?

I had the impression that it might be related to the autobalancer that kicks in and pg's are misplaced again. Or am I missing something?

Bruno

On Mon, 6 Jan 2025 at 16:00, Bruno Gomes Pessanha <bruno.pessanha@xxxxxxxxx> wrote:
So you might set the full ratio to .98, backfillfull to .96.  Nearfull is only cosmetic.
Thanks for the advice. It seems to be working with 0.92 for now. If it gets stuck I'll increase it.

On Mon, 6 Jan 2025 at 00:24, Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:


Very solid advice here - that’s the beauty of Ceph community.

Just adding to what Anthony mentioned: a reweight from 1 to 0.2 (and back) is quite extreme and the cluster won’t like it.

And these days with the balancer, pg-upmap entries to the same effect are a better idea.

From the clients perspective Your main concern now is to keep the pools “alive” with enough space while the backfilling takes place.

To that end, you can *temporarily* give yourself a bit more margin:

ceph osd set-nearfull-ratio .85
ceph osd set-backfillfull-ratio .90
ceph osd set-full-ratio .95
Those are the default values, and Ceph (now) enforces that the values are >= (or maybe >) in that order.

So you might set the full ratio to .98, backfillfull to .96.  Nearfull is only cosmetic.

But absolutely do not forget to revert to default values once the cluster is balanced, or to other values that you make an educated decision to choose.

Even with plenty of OSDs that are not filled you might hit a single overfilled OSD and the whole pool will stop accepting new data. 

Yep, see above.  Not immediately clear to me why that data pool is so full unless the CRUSH rule / device classes are wonky.

Clients will start getting “No more space available” errors. That happened to us with CephFS recently with a very similar scenario where the cluster got much more data than expected in a short amount of time, not fun. 
With the balancer not working due to too many misplaced objects that’s an increased risk so just heads up and keep that in mind. To get things working we simply balanced manually the OSDs with upmaps moving data from the most full ones to the least full ones (our builtin balancer sadly does not work).


One small observation:
I’ve noticed that 'ceph osd pool ls detail |grep cephfs.cephfs01.data’ has pg_num increased but the pgp_num is still the same.
You will need to set it as well for data migration to new pgs to happen: https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups

The mgr usually does that for recent Ceph releases.  With older releases we had to incremental pg_num and pgp_num in lockstep, which was kind of a pain.



Best,
Laimis J.

On 5 Jan 2025, at 16:11, Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:


What reweighs have been set for the top OSDs (ceph osd df tree)?

Right now they are all at 1.0. I had to lower them to something close to
0.2 in order to free up space but I changed them back to 1.0. Should I
lower them while the backfill is happening?

Old-style legacy override reweights don’t mesh well with the balancer.   Best to leave them at 1.00.  

0.2 is pretty extreme, back in the day I rarely went below 0.8.   

```
"optimize_result": "Too many objects (0.355160 > 0.050000) are misplaced;
try again late
```

That should clear.  The balancer doesn’t want to stir up trouble if the cluster already has a bunch of backfill / recovery going on.  Patience!

default.rgw.buckets.data    10  1024  197 TiB  133.75M  592 TiB  93.69
  13 TiB
default.rgw.buckets.non-ec  11    32   78 MiB    1.43M   17 GiB   

That’s odd that the data pool is that full but the others aren’t.  

Please send `ceph osd crush rule dump `.  And `ceph osd dump | grep pool`



I also tried changing the following but it does not seem to persist:

Could be an mclock thing.  

1. Why I ended up with so many misplaced PG's since there were no changes
on the cluster: number of osd's, hosts, etc.

Probably a result of the autoscaler splitting PGs or of some change to CRUSH rules such that some data can’t be placed.

2. Is it ok to change the target_max_misplaced_ratio to something higher
than .05 so the autobalancer would work and I wouldn't have to constantly
rebalance the osd's manually?

I wouldn’t, that’s a symptom not the disease.  
Bruno
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx





--
Bruno Gomes Pessanha
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




--
Bruno Gomes Pessanha


--
Bruno Gomes Pessanha
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux