Re: PG scaling questions

Gabriel Tzagkarakis <gabrieltz@xxxxxxxxx> · Thu, 5 Aug 2021 09:46:42 +0300

hi,

i wanted to report back to you that splitting worked *exactly* as you
described by running "ceph osd pool set default.rgw.buckets.data pg_num 32"
the whole processes took
approximately 2 minutes to split the placement groups and re-peer them from
8 to 32 for 10 OSDs  on 5 hosts.
I had an OSD crash during that time but ceph handled it gracefully.
Downtime was really very minimal. I set target_max_misplaced_ratio to 3%
but the misplaced objects
were around 9% ( 2 active backfills and 2 waiting) which probably has to do
with the fact that each osd has too many objects.

thank you

On Tue, Aug 3, 2021 at 4:51 PM 胡 玮文 <huww98@xxxxxxxxxxx> wrote:

>
> 在 2021年8月3日，21:32，Gabriel Tzagkarakis <gabrieltz@xxxxxxxxx> 写道：
>
> 
> hi , thank you for replying
>
> Does this method refer to manually setting the number of placement groups
> while keeping autoscale_mode setting off ?
> Also from what i can see from the documentation
> the  target_max_misplaced_ratio  implies using the balancer feature, which
> I am currently not using
>
>
> I believe this “auto pgp_num increasing” feature works independently from
> autoscaler and balancer. When the last time I increase pg_num to 1024, I
> have autoscale mode set to warn, and balancer off. I recommend you to read
> this blog.
> https://ceph.io/en/news/blog/2019/new-in-nautilus-pg-merging-and-autotuning/ Specifically,
> near “Starting in Nautilus, this second step is no longer necessary: …”
>
> And target_max_misplaced_ratio is not only used in balancer, but also used
> in this feature.
>
> If I understood correctly the existing PGs will be split in place and act
> as primary for the backfills that will be required to distribute the data
> evenly to all osds
>
> Can i use the manual way to increase slowly pgp in the pool end when my
> PGs have a more manageable size i will enable the balancer.
>
> will there be a considerable amount of downtime splitting pgs and peering ?
>
>
> I didn’t observe any significant downtime the last time I did this. I
> think it is several seconds at most.
>
> I'm sorry for asking too many questions , i'm trying not to break stuff :)
>
> On Tue, Aug 3, 2021 at 3:46 PM 胡 玮文 <huww98@xxxxxxxxxxx> wrote:
>
>> Hi,
>>
>>
>>
>> Each placement group will get split in 4 pieces in-place all at nearly
>> the same time, no empty pgs will be created.
>>
>>
>>
>> Normally, you only set pg_num, but do not touch pgp_num. Instead, you can
>> set “target_max_misplaced_ratio” (default 5%). Then mgr will increase
>> pgp_num for you. It will increase pgp_num so that some pg get placed into
>> another OSD, until misplaced ratio reached target. Then it wait for some
>> backfilling to finish before increasing pgp_num again. (This behavior seems
>> to be introduced in Nautilus)
>>
>>
>>
>> So I don’t think you need to worry about full OSDs. “backfillfull ratio”
>> should throttling backfill when OSD is nearly full, which in turn will
>> throttling pgp_num increase.
>>
>>
>>
>> *发件人: *Gabriel Tzagkarakis <gabrieltz@xxxxxxxxx>
>> *发送时间: *2021年8月3日 19:42
>> *收件人: *ceph-users@xxxxxxx
>> *主题: * PG scaling questions
>>
>>
>>
>> hello everyone,
>>
>> I would like to know how does the autoscale or manual scaling actually
>> works to prevent my
>> cluster from running out of disk space.
>>
>> Let's say i want to scale a pool of 8 PGs each ~400Gb to 32 PGs.
>>
>> 1) does each placement group get split in 4 pieces IN-PLACE all at the
>> same
>> time ?
>> 2) does autoscaling choose one of the existing random placement groups for
>> example X.Y and
>>  creates new empty placement groups and migrates data upon them and then
>> continues to the next big PG with or without deleting the original PG?
>> 3) something else ?
>>
>> I am more concerned about the time period when both the
>> initial/pre-existing PGs and the newly created ones co-exist in the
>> cluster
>> to prevent full osds. In my case each pg has many small files and deleting
>> stray pgs takes a long time.
>>
>> Would it be better if i used something like
>> ceph osd pool set default.rgw.buckets.data pg_num 32
>> and then increase pgp_num in increments of 8 assuming one of the original
>> PGs is affected at a time. But my assumption may be wrong again
>>
>> I could not find something relevant in the documentation
>>
>> Thank you
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx