Re: Questions about PG auto-scaling and node addition

Christophe BAILLON <cb@xxxxxxx> · Fri, 15 Sep 2023 14:00:44 +0200 (CEST)

Thanks for your reply

----- Mail original -----
> De: "Kai Stian Olstad" <ceph+list@xxxxxxxxxx>
> À: "Christophe BAILLON" <cb@xxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxx>
> Envoyé: Jeudi 14 Septembre 2023 21:44:57
> Objet: Re:  Questions about PG auto-scaling and node addition

> On Wed, Sep 13, 2023 at 04:33:32PM +0200, Christophe BAILLON wrote:
>>We have a cluster with 21 nodes, each having 12 x 18TB, and 2 NVMe for db/wal.
>>We need to add more nodes.
>>The last time we did this, the PGs remained at 1024, so the number of PGs per
>>OSD decreased.
>>Currently, we are at 43 PGs per OSD.
>>
>>Does auto-scaling work correctly in Ceph version 17.2.5?
> 
> I would believe so, it's working as designed, default the auto-scaler increasing
> number PGs based on how much data is stored.
> So when you add OSDs, data usage is the same and therefor no scaling is done.
> 
> 
>>Should we increase the number of PGs before adding nodes?
> 
> Adding nodes/OSDs and changing number of PGs involves a lot of data being
> copied around.
> So if those two could be combined you only need to copied the data once instead
> of twice.
> But if that is smart or possible I'm not sure of.

Actually we have 2.1Pt of raw data stored (420M objs), for a 4Pt raw cluster

We will add in the next few months 10 more nodes, with the same conf

We have just one large pool, which we use exclusively for CephFS, and where we only store large files, 2GB each. 
However, we write at a continuous rate between 70 to 150MB/s. 
To prevent io degradation, we have configured a custom osd_mclock_profile to prevent IO collapse during backfilling, recover and to ensure scrubs run smoothly...

> 
> 
>>Should we keep PG auto-scaling active?
>>
>>If we disable auto-scaling, should we increase the number of PGs to reach 100
>>PGs per OSD?
> 
> If you know how much of the data is going to be stored in a pool the best way
> is to set the number of PG up front.
> Because every time the auto-scaler changed the number of PGs you will have a
> huge amount of data being copied around to other OSDs.
> 
> You can set the target size or target ratio[1] and the auto-scaler with set the
> appropriate number of PGs on the pool.
> 
> But if you know how much data is going to be stored in a pool you can turn it
> of and just set it manually.

when I create the pool I set 
ceph osd pool set cephfs_data target_size_ratio .9

For a 4Pt raw pool, I got 1024 pg... it's not enough, where is my error ?
I haven't set this 
ceph config set global mon_target_pg_per_osd 100

or this, witch is my target for the next 2 years

ceph osd pool set mypool target_size_bytes 8PT

Or, disabling the pg auto-scaler and increase the pg number manually, at the same time I add new nodes ?

I'm little bit lost :)

Regards

> 
> 100 is a rule of thumb, but with so large disk you could or maybe should
> consider having a higher number of PGs per OSD.
> 
> 
> [1]
> https://docs.ceph.com/en/quincy/rados/operations/placement-groups/#viewing-pg-scaling-recommendations
> 
> --
> Kai Stian Olstad

-- 
Christophe BAILLON
Mobile :: +336 16 400 522
Work :: https://eyona.com
Twitter :: https://twitter.com/ctof
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx