Re: Advice on enabling autoscaler

Dan van der Ster <daniel.vanderster@xxxxxxx> · Mon, 7 Feb 2022 14:05:43 +0100

> On 02/07/2022 1:51 PM Maarten van Ingen <maarten.vaningen@xxxxxxx> wrote:
> One more thing -- how many PGs do you have per OSD right now for the nvme and hdd roots?
> Can you share the output of `ceph osd df tree` ?
> 
> >> This is only 1347 lines of text, you sure you want that :-) On a summary for HDD we have between 7 and 55 PG's, OSD size range from 10 to 14TB.
> >> NVMe is between 30 and 60, size all 1,4T; we run 4 OSD's per NVMe.

I see. pastebin can make this readable. Share privately if you prefer.

> Generally, the autoscaler is trying to increase your pools so that there are roughly 100 PGs per OSD. This number is a good rule of thumb to balance memory usage of the OSD and balancing of the data.
> However, if your cluster is already adequately balanced (with the upmap balancer) then there might not be much use in splitting these pools further.
> 
> >> We still have a few really old no Luminous clients and thus cannot use upmap but the older one. Balancing has been done before by hand but it's getting tedious at best and therefore we want (need) to use the auto-balancer as well. Idea was to increase PG's first and auto balance afterwards.

The other balancers are not very good. Is it really not an option to upgrade those old clients so you can enable the upmap balancer? It should do a good job even before you split the pools.

> That said -- some of your splits should go quite quickly, e.g. the nvme 256 -> 2048 having only 4GB of data.
> 
> >> That I know we already did a few splits from 128 to 256 and this was really fast. But it's safe to incrase pg_num and pgp_num in one go for these pools?

I think the cli will only let you x2 or maybe x4 in one go.
Set pg_num, then wait for them all to be created, then watch `ceph osd pool ls detail` -- it will show the pg_num, pgp_num, pg_num_target and pgp_num_target which together show the splitting progress.

> Some more gentle advice, if you do decide to go ahead with this, would be to take the autoscalers guidance and make the pg_num changes yourself. (Splitting your pool having 1128T of data will take several weeks -- you probably want to make the changes gradually, not all at once).
> 
> >> That's the kind of advice we are looking for ;) would this mean going to 8k first and then 16k or even intermediate steps? And pgp_num, how to increase this?

By gentle I mean a few hundred at a time, or even just 10 at first just to see how long the first few splitting PGs go.

pgp_num should move automatically if you set pg_num. (See the _target vals I mentioned earlier).

> (You can sense I'm hesitating to recommend you just blindly enable the autoscaler now that you have so much data in the cluster -- I fear it might be disruptive for several weeks at best, and at worst you may hit that pg log OOM bug).
> 
> >> But this bug would not hit with a manual increase?

No, it could still hit.

But we've split a huge pool from 4096 to 8192 sometime last year. It triggered a few bugs but no disasters. (It took a few weeks).

-- dan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx