> On Dec 23, 2024, at 12:08 PM, Rok Jaklič <rjaklic@xxxxxxxxx> wrote: > > For default.rgw.buckets.index ssd-s (actually nvme-s), and for > default.rgw.buckets.data hdd-s. Average PG is around ~17.7. Yikes, that is not doing you any favors at all, in terms of performance and of uniform OSD utilization. The party line is currently a target ratio of 100, though I have a PR open to return it to the former target of 200. I’d really like to make that 500 but we need to be somewhat conservative. Once you get your CRUSH rules / device classes sorted out the autoscaler should grow your pg_nums substantially, or you can take a walk on the wild side by turning it off and calculating yourself, old-school:-style https://docs.ceph.com/en/squid/rados/operations/pgcalc/ > > Actually most of the disks are Seagate 6T > <https://www.amazon.com/Seagate-Enterprise-Capacity-ST6000NM0095-7200RPM/dp/B01CG0DBXE> > in size, but this "translates" to 5.5T in ceph and yes, they are pretty old > (from 2016, 2017 up to 2021). Ack, I suspected so. That “translation” is in large part due to storage manufacturers being weasels: they describe devices in terms of base-10 units (TB), and humans and everyone else mainly think base-2 units (TiB). 6.0 TB = 5.45697 TiB. Back like 10-12 years ago Apple switched the macOS Finder from using the former to the latter and people were outraged because they believed that Apple was taking storage away. > > Rok > > On Mon, Dec 23, 2024 at 4:41 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> > wrote: > >> >> >> >> [root@ctplmon1 ~]# ceph osd dump | grep pool >> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash >> rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 320144 flags >> hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth >> pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash >> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 320144 lfor >> 0/18964/18962 flags hashpspool stripe_width 0 application rgw >> pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 >> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change >> 320144 lfor 0/127672/127670 flags hashpspool stripe_width 0 application rgw >> pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 >> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change >> 320144 lfor 0/59850/59848 flags hashpspool stripe_width 0 application rgw >> pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 >> object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change >> 320144 lfor 0/51538/51536 flags hashpspool stripe_width 0 pg_autoscale_bias >> 4 pg_num_min 8 application rgw >> pool 6 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule >> 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change >> 315285 lfor 0/127830/127828 flags hashpspool stripe_width 0 >> pg_autoscale_bias 4 pg_num_min 8 application rgw >> pool 7 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 >> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on >> last_change 320144 lfor 0/76474/76472 flags hashpspool stripe_width 0 >> application rgw >> pool 9 'default.rgw.buckets.data' erasure profile ec-32-profile size 5 >> min_size 4 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 >> autoscale_mode on last_change 320144 lfor 0/127784/214408 flags >> hashpspool,ec_overwrites stripe_width 12288 application rgw >> pool 10 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 >> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change >> 320144 flags hashpspool,bulk stripe_width 0 application cephfs >> pool 11 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 4 >> object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change >> 320144 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 >> recovery_priority 5 application cephfs >> >> --- >> >> >> Are you using HDDs, SSDs, or both? What does the PGs column at the right >> end of `ceph osd df` average? I’m still spinning up my brain this morning, >> but this seems reeeeeally low, like ~17 if all the OSDs are the same device >> class. >> >> buckets.index, notably, should be way higher. Assuming that your OSDs are >> all identical and thus that the index pool spans them all, I’d increase >> pg_num for the index pool and cephfs_metadata to 256 and for buckets.data >> to maybe 2048. >> >> >> >> >> Right now there are around 200 osds (5.5T) in a cluster, with around 25 >> waiting to be added. >> >> >> 5.5T seems like an unusual number. Are these old HDDs, or perhaps 3DWPD >> SSDs? >> >> >> >> >> Rok >> >> On Mon, Dec 23, 2024 at 4:16 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> >> wrote: >> >>> >>> >>>> autoscale_mode for pg is on for a particular pool >>>> (default.rgw.buckets.data) and EC 3-2 is used. During pool lifetime I've >>>> seen one time that PG number have changed automatically >>> >>> pg_num for a given pool likes to be a power of 2, so either the relative >>> usage of pools or the overall cluster fillage has to change substantially >>> for a change to be triggered in many cases. >>> >>>> but now I am also considering changing PG number manually after >>> backfills completes. >>> >>> If you do, be sure to disable the autoscaler for that pool. >>> >>>> Right now pg_num 512 pgp_num 512 is used and I am considering to change >>> it >>>> to 1024. Do you think that would be too aggressive maybe? >>> >>> Depends on how many OSDs you have and what the rest of the pools are >>> like. Send us >>> >>> `ceph osd dump | grep pool` >>> >>> These days, assuming that your OSDs are BlueStore, chances are that going >>> higher on pg_num won’t cause issues. >>> >>>> >>>> Rok >>>> >>>> On Sun, Dec 22, 2024 at 8:46 PM Alwin Antreich <alwin.antreich@xxxxxxxx >>>> >>>> wrote: >>>> >>>>> Hi Rok, >>>>> >>>>> On Sun, 22 Dec 2024 at 20:19, Rok Jaklič <rjaklic@xxxxxxxxx> wrote: >>>>> >>>>>> First I tried with osd reweight, waited a few hours then osd crush >>>>>> reweight, then with pg-umpap from Laimis. Seems to crush reweight was >>> most >>>>>> effective, but not for "all" osds I tried. >>>>>> >>>>>> Uh, probably I've set ceph config set osd osd_max_backfills to high >>>>>> number in the past, probably better to reduce it to 1 in steps, since >>> now >>>>>> much backfilling is already going on? >>>>>> >>>>> Every time a backfill finishes, a new one will be placed in the queue. >>> The >>>>> number of backfills won't reduce as long as you don't lower it. You can >>>>> adjust it and see if it improves the backfill process or not (wait an >>> hour >>>>> or two). >>>>> >>>>> >>>>>> >>>>>> Output of commands in attachment. >>>>>> >>>>> There seems to be a low amount of PGs for the rgw data pool, compared >>> to >>>>> the amount of OSDs. Though it depends on the EC profile and size of a >>> shard >>>>> (`ceph pg <id> query`) if this is really an issue. But in general the >>>>> amount of PGs is important, because too few of them will make them grow >>>>> larger. Hence backfilling a PG will take a longer time and easier >>> tilts the >>>>> usage of OSDs, as the algorithm works by pseudo-randomly placing PGs >>> and >>>>> not taking its size into account. >>>>> >>>>> I'd wait with the PG adjustment after the backfilling to the HDDs has >>>>> finished, should you need to adjust the number of PGs. As this will >>> create >>>>> more data movement. >>>>> >>>>> Cheers, >>>>> Alwin >>>>> croit GmbH, https://croit.io/ >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >>> >> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx