Re: radosgw stopped working

Rok Jaklič <rjaklic@xxxxxxxxx> · Mon, 23 Dec 2024 18:08:20 +0100

For default.rgw.buckets.index ssd-s (actually nvme-s), and for
default.rgw.buckets.data hdd-s. Average PG is around ~17.7.

Actually most of the disks are Seagate 6T
<https://www.amazon.com/Seagate-Enterprise-Capacity-ST6000NM0095-7200RPM/dp/B01CG0DBXE>
in size, but this "translates" to 5.5T in ceph and yes, they are pretty old
(from 2016, 2017 up to 2021).

Rok

On Mon, Dec 23, 2024 at 4:41 PM Anthony D'Atri <anthony.datri@xxxxxxxxx>
wrote:

>
>
>
> [root@ctplmon1 ~]# ceph osd dump | grep pool
> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 320144 flags
> hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth
> pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 320144 lfor
> 0/18964/18962 flags hashpspool stripe_width 0 application rgw
> pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 320144 lfor 0/127672/127670 flags hashpspool stripe_width 0 application rgw
> pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 320144 lfor 0/59850/59848 flags hashpspool stripe_width 0 application rgw
> pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
> 320144 lfor 0/51538/51536 flags hashpspool stripe_width 0 pg_autoscale_bias
> 4 pg_num_min 8 application rgw
> pool 6 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
> 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
> 315285 lfor 0/127830/127828 flags hashpspool stripe_width 0
> pg_autoscale_bias 4 pg_num_min 8 application rgw
> pool 7 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 320144 lfor 0/76474/76472 flags hashpspool stripe_width 0
> application rgw
> pool 9 'default.rgw.buckets.data' erasure profile ec-32-profile size 5
> min_size 4 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512
> autoscale_mode on last_change 320144 lfor 0/127784/214408 flags
> hashpspool,ec_overwrites stripe_width 12288 application rgw
> pool 10 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change
> 320144 flags hashpspool,bulk stripe_width 0 application cephfs
> pool 11 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 4
> object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
> 320144 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
> recovery_priority 5 application cephfs
>
> ---
>
>
> Are you using HDDs, SSDs, or both?  What does the PGs column at the right
> end of `ceph osd df` average?  I’m still spinning up my brain this morning,
> but this seems reeeeeally low, like ~17 if all the OSDs are the same device
> class.
>
> buckets.index, notably, should be way higher.  Assuming that your OSDs are
> all identical and thus that the index pool spans them all, I’d increase
> pg_num for the index pool and cephfs_metadata to 256 and for buckets.data
> to maybe 2048.
>
>
>
>
> Right now there are around 200 osds (5.5T) in a cluster, with around 25
> waiting to be added.
>
>
> 5.5T seems like an unusual number.  Are these old HDDs, or perhaps 3DWPD
> SSDs?
>
>
>
>
> Rok
>
> On Mon, Dec 23, 2024 at 4:16 PM Anthony D'Atri <anthony.datri@xxxxxxxxx>
> wrote:
>
>>
>>
>> > autoscale_mode for pg is on for a particular pool
>> > (default.rgw.buckets.data) and EC 3-2 is used. During pool lifetime I've
>> > seen one time that PG number have changed automatically
>>
>> pg_num for a given pool likes to be a power of 2, so either the relative
>> usage of pools or the overall cluster fillage has to change substantially
>> for a change to be triggered in many cases.
>>
>> > but now I am also considering changing PG number manually after
>> backfills completes.
>>
>> If you do, be sure to disable the autoscaler for that pool.
>>
>> > Right now pg_num 512 pgp_num 512 is used and I am considering to change
>> it
>> > to 1024. Do you think that would be too aggressive maybe?
>>
>> Depends on how many OSDs you have and what the rest of the pools are
>> like.  Send us
>>
>> `ceph osd dump | grep pool`
>>
>> These days, assuming that your OSDs are BlueStore, chances are that going
>> higher on pg_num won’t cause issues.
>>
>> >
>> > Rok
>> >
>> > On Sun, Dec 22, 2024 at 8:46 PM Alwin Antreich <alwin.antreich@xxxxxxxx
>> >
>> > wrote:
>> >
>> >> Hi Rok,
>> >>
>> >> On Sun, 22 Dec 2024 at 20:19, Rok Jaklič <rjaklic@xxxxxxxxx> wrote:
>> >>
>> >>> First I tried with osd reweight, waited a few hours then osd crush
>> >>> reweight, then with pg-umpap from Laimis. Seems to crush reweight was
>> most
>> >>> effective, but not for "all" osds I tried.
>> >>>
>> >>> Uh, probably I've set ceph config set osd osd_max_backfills to high
>> >>> number in the past, probably better to reduce it to 1 in steps, since
>> now
>> >>> much backfilling is already going on?
>> >>>
>> >> Every time a backfill finishes, a new one will be placed in the queue.
>> The
>> >> number of backfills won't reduce as long as you don't lower it. You can
>> >> adjust it and see if it improves the backfill process or not (wait an
>> hour
>> >> or two).
>> >>
>> >>
>> >>>
>> >>> Output of commands in attachment.
>> >>>
>> >> There seems to be a low amount of PGs for the rgw data pool, compared
>> to
>> >> the amount of OSDs. Though it depends on the EC profile and size of a
>> shard
>> >> (`ceph pg <id> query`) if this is really an issue. But in general the
>> >> amount of PGs is important, because too few of them will make them grow
>> >> larger. Hence backfilling a PG will take a longer time and easier
>> tilts the
>> >> usage of OSDs, as the algorithm works by pseudo-randomly placing PGs
>> and
>> >> not taking its size into account.
>> >>
>> >> I'd wait with the PG adjustment after the backfilling to the HDDs has
>> >> finished, should you need to adjust the number of PGs. As this will
>> create
>> >> more data movement.
>> >>
>> >> Cheers,
>> >> Alwin
>> >> croit GmbH, https://croit.io/
>> >>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx