Re: Advice on enabling autoscaler

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dan,

--

Hi,

OK, you don't need to set 'warn' mode -- the autoscale status already has the info we need.

One more thing -- how many PGs do you have per OSD right now for the nvme and hdd roots?
Can you share the output of `ceph osd df tree` ?

>> This is only 1347 lines of text, you sure you want that :-) On a summary for HDD we have between 7 and 55 PG's, OSD size range from 10 to 14TB.
>> NVMe is between 30 and 60, size all 1,4T; we run 4 OSD's per NVMe.

Generally, the autoscaler is trying to increase your pools so that there are roughly 100 PGs per OSD. This number is a good rule of thumb to balance memory usage of the OSD and balancing of the data.
However, if your cluster is already adequately balanced (with the upmap balancer) then there might not be much use in splitting these pools further.

>> We still have a few really old no Luminous clients and thus cannot use upmap but the older one. Balancing has been done before by hand but it's getting tedious at best and therefore we want (need) to use the auto-balancer as well. Idea was to increase PG's first and auto balance afterwards.

That said -- some of your splits should go quite quickly, e.g. the nvme 256 -> 2048 having only 4GB of data.

>> That I know we already did a few splits from 128 to 256 and this was really fast. But it's safe to incrase pg_num and pgp_num in one go for these pools?

Some more gentle advice, if you do decide to go ahead with this, would be to take the autoscalers guidance and make the pg_num changes yourself. (Splitting your pool having 1128T of data will take several weeks -- you probably want to make the changes gradually, not all at once).

>> That's the kind of advice we are looking for ;) would this mean going to 8k first and then 16k or even intermediate steps? And pgp_num, how to increase this?

(You can sense I'm hesitating to recommend you just blindly enable the autoscaler now that you have so much data in the cluster -- I fear it might be disruptive for several weeks at best, and at worst you may hit that pg log OOM bug).

>> But this bug would not hit with a manual increase?

Cheers, Dan

> On 02/07/2022 1:15 PM Maarten van Ingen <maarten.vaningen@xxxxxxx> wrote:
>
>
> Hi Dan,
>
> Here's the output. I removed pool names on purpose.
>
> SIZE TARGET SIZE RATE RAW CAPACITY  RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
> 19       100.0T  3.0       11098T 0.0270                               1.0    256            off
> 104.5G   1024G   3.0       558.9T 0.0054                               1.0    256            off
> 2152M    1024G   3.0       11098T 0.0003                               1.0    256            off
> 4276M    35840G  3.0       558.9T 0.1879                               1.0    256       2048 off
> 19       100.0G  3.0       11098T 0.0000                               1.0    256            off
> 274.2M   100.0T  3.0       11098T 0.0270                               1.0    512            off
> 19       100.0G  3.0       11098T 0.0000                               1.0    256            off
> 308.5G   1024G   3.0       558.9T 0.0054                               1.0    256            off
> 3203G    35840G  3.0       558.9T 0.1879                               1.0   1024            off
> 0        100.0G  3.0       558.9T 0.0005                               1.0    256            off
> 186.8T   600.0T  3.0       11098T 0.1622                               1.0    512       4096 off
> 1820G    35840G  3.0       558.9T 0.1879                               1.0    512       2048 off
> 11465M   51200M  3.0       558.9T 0.0003                               1.0    256            off
> 109.2T   874.0T  3.0       11098T 0.2363                               1.0    512       8192 off
> 2032M    100.0G  3.0       558.9T 0.0005                               1.0    256            off
> 8494G    35840G  3.0       558.9T 0.1879                               1.0    512       2048 off
> 814.9G   10240G  3.0       11098T 0.0027                               1.0    512            off
> 0        5120G   3.0       558.9T 0.0268                               1.0    256            off
> 19       100.0G  3.0       11098T 0.0000                               1.0    256            off
> 186.5G   1024G   3.0       558.9T 0.0054                               1.0    256            off
> 311.0M   10240M  3.0       558.9T 0.0001                               1.0    256            off
> 30011M   100.0G  3.0       558.9T 0.0005                               1.0    512            off
> 77846M   250.0G  3.0       558.9T 0.0013                               1.0    256            off
> 1076M    100.0T  3.0       11098T 0.0270                               1.0    256            off
> 3585G    10240G  3.0       558.9T 0.0537                               1.0    256            off
> 1128T    1716T   3.0       11098T 0.4639                               1.0   4096      16384 off
> 1641G    10240G  3.0       558.9T 0.0537                               1.0    256            off
> 4877M    100.0G  3.0       11098T 0.0000                               1.0    512            off
>
> We surely will go to warn state first before enabling it, but I wasn't aware the warn tells you more on what it is going to do. Currently we just look at the above output.
> Before taking any steps I was wondering what the best course of action is. As it's just a few pools affected, doing a manual increase would be and option for me as well, if recommended.
>
> As you can see one pool is basically lacking pg's while the others are mostly increasing due to the much higher target_bytes compared to the current usage.
>
> ________________________________________
> Van: Dan van der Ster <daniel.vanderster@xxxxxxx>
> Verzonden: maandag 7 februari 2022 12:53
> Aan: Maarten van Ingen; ceph-users
> Onderwerp: Re:  Advice on enabling autoscaler
>
> Dear Maarten,
>
> For a cluster that size, I would not immediately enable the autoscaler but first enabled it in "warn" mode to sanity check what it would plan to do:
>
> # ceph osd pool set <pool> pg_autoscale_mode warn
>
> Please share the output of "ceph osd pool autoscale-status" so we can help guide what you do next.
>
> Also, you should be aware that there are some rare but unpleasant bugs that may be related to PG splitting (autoscaling). See https://tracker.ceph.com/issues/53729
> You may want to wait until that issue is resolved before permanently enabling the autoscaler.
>
> Best Regards,
>
> Dan
>
>
> > On 02/07/2022 12:31 PM Maarten van Ingen <maarten.vaningen@xxxxxxx> wrote:
> >
> >
> > Hi,
> >
> > We are about to enable the PG autoscaler on CEPH. Currently we are running the latest subrelease of Nautilus with Bluestore and LVM. The current status of the autoscaler is that it’s turned off on all pools and the module is enabled.
> >
> > To make sure we do not kill anything, performance and/or data. I’d like some advice on how to proceed.
> >
> > We have about 11PiB of raw HDD storage and 40ish% is in use and about 550TiB of NVMe storage. In total we have about 1250 OSD’s of which about 300 are NVMe only OSD’s. We have crush rules to allow for NVMe only storage pools and HDD only storage pools
> >
> > For every pool we have we have set a target-size to guide the autoscaler a bit and also we have set a minimum of 256 PG’s per pool.
> > What now happens is that a few pools will have their amount of PG’s changed ranging from 4x to 16x. We have never changed the amount of PG’s in a pool with these factors (no more than 2x in a single go) and also with a lot less data. So we have no clear idea of what will happen when we enable the autoscaler.
> >
> > For example, one pool which has about 1PiB of user data will grow from 4k PG’s to 16k PG’s. This of course will involve a lot of data movement. Another pool with 100TiB of data will grow from 512 to 8k PG’s
> >
> > All pools are set with a size of 3 and thus the abovementioned 1PiB is 3PiB of raw data, we currently have no erasure coding pools.
> >
> > Can somebody help me out on how to proceed on a safe way to enable the autoscaler or tell me of it’s OK just to enable the autoscaler. We will enable it per pool to limit affected pools.
> >
> > Met vriendelijke groet,
> >
> > Kind Regards,
> > Maarten van Ingen
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux