For default.rgw.buckets.index ssd-s (actually nvme-s), and for default.rgw.buckets.data hdd-s. Average PG is around ~17.7. Actually most of the disks are Seagate 6T <https://www.amazon.com/Seagate-Enterprise-Capacity-ST6000NM0095-7200RPM/dp/B01CG0DBXE> in size, but this "translates" to 5.5T in ceph and yes, they are pretty old (from 2016, 2017 up to 2021). Rok On Mon, Dec 23, 2024 at 4:41 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote: > > > > [root@ctplmon1 ~]# ceph osd dump | grep pool > pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 320144 flags > hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth > pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 320144 lfor > 0/18964/18962 flags hashpspool stripe_width 0 application rgw > pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change > 320144 lfor 0/127672/127670 flags hashpspool stripe_width 0 application rgw > pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change > 320144 lfor 0/59850/59848 flags hashpspool stripe_width 0 application rgw > pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change > 320144 lfor 0/51538/51536 flags hashpspool stripe_width 0 pg_autoscale_bias > 4 pg_num_min 8 application rgw > pool 6 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule > 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change > 315285 lfor 0/127830/127828 flags hashpspool stripe_width 0 > pg_autoscale_bias 4 pg_num_min 8 application rgw > pool 7 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 > crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on > last_change 320144 lfor 0/76474/76472 flags hashpspool stripe_width 0 > application rgw > pool 9 'default.rgw.buckets.data' erasure profile ec-32-profile size 5 > min_size 4 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 > autoscale_mode on last_change 320144 lfor 0/127784/214408 flags > hashpspool,ec_overwrites stripe_width 12288 application rgw > pool 10 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change > 320144 flags hashpspool,bulk stripe_width 0 application cephfs > pool 11 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 4 > object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change > 320144 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 > recovery_priority 5 application cephfs > > --- > > > Are you using HDDs, SSDs, or both? What does the PGs column at the right > end of `ceph osd df` average? I’m still spinning up my brain this morning, > but this seems reeeeeally low, like ~17 if all the OSDs are the same device > class. > > buckets.index, notably, should be way higher. Assuming that your OSDs are > all identical and thus that the index pool spans them all, I’d increase > pg_num for the index pool and cephfs_metadata to 256 and for buckets.data > to maybe 2048. > > > > > Right now there are around 200 osds (5.5T) in a cluster, with around 25 > waiting to be added. > > > 5.5T seems like an unusual number. Are these old HDDs, or perhaps 3DWPD > SSDs? > > > > > Rok > > On Mon, Dec 23, 2024 at 4:16 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> > wrote: > >> >> >> > autoscale_mode for pg is on for a particular pool >> > (default.rgw.buckets.data) and EC 3-2 is used. During pool lifetime I've >> > seen one time that PG number have changed automatically >> >> pg_num for a given pool likes to be a power of 2, so either the relative >> usage of pools or the overall cluster fillage has to change substantially >> for a change to be triggered in many cases. >> >> > but now I am also considering changing PG number manually after >> backfills completes. >> >> If you do, be sure to disable the autoscaler for that pool. >> >> > Right now pg_num 512 pgp_num 512 is used and I am considering to change >> it >> > to 1024. Do you think that would be too aggressive maybe? >> >> Depends on how many OSDs you have and what the rest of the pools are >> like. Send us >> >> `ceph osd dump | grep pool` >> >> These days, assuming that your OSDs are BlueStore, chances are that going >> higher on pg_num won’t cause issues. >> >> > >> > Rok >> > >> > On Sun, Dec 22, 2024 at 8:46 PM Alwin Antreich <alwin.antreich@xxxxxxxx >> > >> > wrote: >> > >> >> Hi Rok, >> >> >> >> On Sun, 22 Dec 2024 at 20:19, Rok Jaklič <rjaklic@xxxxxxxxx> wrote: >> >> >> >>> First I tried with osd reweight, waited a few hours then osd crush >> >>> reweight, then with pg-umpap from Laimis. Seems to crush reweight was >> most >> >>> effective, but not for "all" osds I tried. >> >>> >> >>> Uh, probably I've set ceph config set osd osd_max_backfills to high >> >>> number in the past, probably better to reduce it to 1 in steps, since >> now >> >>> much backfilling is already going on? >> >>> >> >> Every time a backfill finishes, a new one will be placed in the queue. >> The >> >> number of backfills won't reduce as long as you don't lower it. You can >> >> adjust it and see if it improves the backfill process or not (wait an >> hour >> >> or two). >> >> >> >> >> >>> >> >>> Output of commands in attachment. >> >>> >> >> There seems to be a low amount of PGs for the rgw data pool, compared >> to >> >> the amount of OSDs. Though it depends on the EC profile and size of a >> shard >> >> (`ceph pg <id> query`) if this is really an issue. But in general the >> >> amount of PGs is important, because too few of them will make them grow >> >> larger. Hence backfilling a PG will take a longer time and easier >> tilts the >> >> usage of OSDs, as the algorithm works by pseudo-randomly placing PGs >> and >> >> not taking its size into account. >> >> >> >> I'd wait with the PG adjustment after the backfilling to the HDDs has >> >> finished, should you need to adjust the number of PGs. As this will >> create >> >> more data movement. >> >> >> >> Cheers, >> >> Alwin >> >> croit GmbH, https://croit.io/ >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx