On 2020-08-19 04:05, norman wrote: > Hans, > > I made a big change in my staging cluster before, I set a pool pg_num > from 8 to 2048, it cased the cluster available for a long time :( I doubt it will have that big an effect this time. The change from 8 -> 2048 is way bigger than "just" a doubling to 4096. It of course depends on how much data has been added to the pool in the meantime. But if the PGs hold less data per PG, the impact is probably lower. It's always a tradeoff between recovery speed and decent client IO performance, and highly dependent per use case, cluster size, hardware specifications like disk type, amount of RAM / CPU and networking. For some clusters the values you have used might by no big deal (if you have hundreds of nodes) but for small(er) clusters this can have a big impact. Might, if you have NVMe and loads of memory and CPU you might even get away with it. So it's best to start with low(est) possible recovery/backfill settings and slowly scaling up. Having metrics of your cluster and client VMs (perceived latency on VMs for example) will be crucial to put a number on this. Having a baseline of the performance of the cluster will help to decide when things get out of hand. It will also help to identify the time the cluster is less busy, and use that time window to perform maintenance (i.e. not when backups are active, which tend to stress clusters a lot). Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx