Re: radosgw stopped working

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Mon, 23 Dec 2024 15:37:13 -0500

> On Dec 23, 2024, at 12:08 PM, Rok Jaklič <rjaklic@xxxxxxxxx> wrote:
> 
> For default.rgw.buckets.index ssd-s (actually nvme-s), and for
> default.rgw.buckets.data hdd-s. Average PG is around ~17.7.

Yikes, that is not doing you any favors at all, in terms of performance and of uniform OSD utilization.

The party line is currently a target ratio of 100, though I have a PR open to return it to the former target of 200.  I’d really like to make that 500 but we need to be somewhat conservative.  Once you get your CRUSH rules / device classes sorted out the autoscaler should grow your pg_nums substantially, or you can take a walk on the wild side by turning it off and calculating yourself, old-school:-style  https://docs.ceph.com/en/squid/rados/operations/pgcalc/

> 
> Actually most of the disks are Seagate 6T
> <https://www.amazon.com/Seagate-Enterprise-Capacity-ST6000NM0095-7200RPM/dp/B01CG0DBXE>
> in size, but this "translates" to 5.5T in ceph and yes, they are pretty old
> (from 2016, 2017 up to 2021).

Ack, I suspected so.  That “translation” is in large part due to storage manufacturers being weasels:  they describe devices in terms of base-10 units (TB), and humans and everyone else mainly think  base-2 units (TiB).  6.0 TB = 5.45697 TiB.  Back like 10-12 years ago Apple switched the macOS Finder from using the former to the latter and people were outraged because they believed that Apple was taking storage away.

> 
> Rok
> 
> On Mon, Dec 23, 2024 at 4:41 PM Anthony D'Atri <anthony.datri@xxxxxxxxx>
> wrote:
> 
>> 
>> 
>> 
>> [root@ctplmon1 ~]# ceph osd dump | grep pool
>> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 320144 flags
>> hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth
>> pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 320144 lfor
>> 0/18964/18962 flags hashpspool stripe_width 0 application rgw
>> pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
>> 320144 lfor 0/127672/127670 flags hashpspool stripe_width 0 application rgw
>> pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
>> 320144 lfor 0/59850/59848 flags hashpspool stripe_width 0 application rgw
>> pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
>> 320144 lfor 0/51538/51536 flags hashpspool stripe_width 0 pg_autoscale_bias
>> 4 pg_num_min 8 application rgw
>> pool 6 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
>> 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
>> 315285 lfor 0/127830/127828 flags hashpspool stripe_width 0
>> pg_autoscale_bias 4 pg_num_min 8 application rgw
>> pool 7 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
>> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
>> last_change 320144 lfor 0/76474/76472 flags hashpspool stripe_width 0
>> application rgw
>> pool 9 'default.rgw.buckets.data' erasure profile ec-32-profile size 5
>> min_size 4 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512
>> autoscale_mode on last_change 320144 lfor 0/127784/214408 flags
>> hashpspool,ec_overwrites stripe_width 12288 application rgw
>> pool 10 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change
>> 320144 flags hashpspool,bulk stripe_width 0 application cephfs
>> pool 11 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 4
>> object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
>> 320144 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
>> recovery_priority 5 application cephfs
>> 
>> ---
>> 
>> 
>> Are you using HDDs, SSDs, or both?  What does the PGs column at the right
>> end of `ceph osd df` average?  I’m still spinning up my brain this morning,
>> but this seems reeeeeally low, like ~17 if all the OSDs are the same device
>> class.
>> 
>> buckets.index, notably, should be way higher.  Assuming that your OSDs are
>> all identical and thus that the index pool spans them all, I’d increase
>> pg_num for the index pool and cephfs_metadata to 256 and for buckets.data
>> to maybe 2048.
>> 
>> 
>> 
>> 
>> Right now there are around 200 osds (5.5T) in a cluster, with around 25
>> waiting to be added.
>> 
>> 
>> 5.5T seems like an unusual number.  Are these old HDDs, or perhaps 3DWPD
>> SSDs?
>> 
>> 
>> 
>> 
>> Rok
>> 
>> On Mon, Dec 23, 2024 at 4:16 PM Anthony D'Atri <anthony.datri@xxxxxxxxx>
>> wrote:
>> 
>>> 
>>> 
>>>> autoscale_mode for pg is on for a particular pool
>>>> (default.rgw.buckets.data) and EC 3-2 is used. During pool lifetime I've
>>>> seen one time that PG number have changed automatically
>>> 
>>> pg_num for a given pool likes to be a power of 2, so either the relative
>>> usage of pools or the overall cluster fillage has to change substantially
>>> for a change to be triggered in many cases.
>>> 
>>>> but now I am also considering changing PG number manually after
>>> backfills completes.
>>> 
>>> If you do, be sure to disable the autoscaler for that pool.
>>> 
>>>> Right now pg_num 512 pgp_num 512 is used and I am considering to change
>>> it
>>>> to 1024. Do you think that would be too aggressive maybe?
>>> 
>>> Depends on how many OSDs you have and what the rest of the pools are
>>> like.  Send us
>>> 
>>> `ceph osd dump | grep pool`
>>> 
>>> These days, assuming that your OSDs are BlueStore, chances are that going
>>> higher on pg_num won’t cause issues.
>>> 
>>>> 
>>>> Rok
>>>> 
>>>> On Sun, Dec 22, 2024 at 8:46 PM Alwin Antreich <alwin.antreich@xxxxxxxx
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Rok,
>>>>> 
>>>>> On Sun, 22 Dec 2024 at 20:19, Rok Jaklič <rjaklic@xxxxxxxxx> wrote:
>>>>> 
>>>>>> First I tried with osd reweight, waited a few hours then osd crush
>>>>>> reweight, then with pg-umpap from Laimis. Seems to crush reweight was
>>> most
>>>>>> effective, but not for "all" osds I tried.
>>>>>> 
>>>>>> Uh, probably I've set ceph config set osd osd_max_backfills to high
>>>>>> number in the past, probably better to reduce it to 1 in steps, since
>>> now
>>>>>> much backfilling is already going on?
>>>>>> 
>>>>> Every time a backfill finishes, a new one will be placed in the queue.
>>> The
>>>>> number of backfills won't reduce as long as you don't lower it. You can
>>>>> adjust it and see if it improves the backfill process or not (wait an
>>> hour
>>>>> or two).
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Output of commands in attachment.
>>>>>> 
>>>>> There seems to be a low amount of PGs for the rgw data pool, compared
>>> to
>>>>> the amount of OSDs. Though it depends on the EC profile and size of a
>>> shard
>>>>> (`ceph pg <id> query`) if this is really an issue. But in general the
>>>>> amount of PGs is important, because too few of them will make them grow
>>>>> larger. Hence backfilling a PG will take a longer time and easier
>>> tilts the
>>>>> usage of OSDs, as the algorithm works by pseudo-randomly placing PGs
>>> and
>>>>> not taking its size into account.
>>>>> 
>>>>> I'd wait with the PG adjustment after the backfilling to the HDDs has
>>>>> finished, should you need to adjust the number of PGs. As this will
>>> create
>>>>> more data movement.
>>>>> 
>>>>> Cheers,
>>>>> Alwin
>>>>> croit GmbH, https://croit.io/
>>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>>> 
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx