I decided to start a new thread to discuss this tool. I added in some comments and removed a few things specific to our environment (like needing to run ceph as sudo because we have our ceph config files readable only by root).
To answer Tomasz's question. We have our down_out interval set really high so that when an OSD goes down, we go in and generate a new map before we remove the drive so it only backfills once. With that it moves data much less because you don't backfill when it goes out and then again to balance the cluster. Generally this backfilling is about the same as the backfill that happens automatically when the osd goes out.
In it's current incarnation...
1) This script is capable of balancing a cluster with 1 pool that has a vast majority of all of the data (hardcoded to rbd, but easily changeable)
2) It is assumed that all of your drives are larger than 1000GB for how it calculates how many pgs you should have per TB.
3) It works by changing weights on the crush map until all osds are within 2 pgs of each other for the primary data pool.
4) The --offset option is pivotal to balancing the map. Test this setting going up and down until you have the best set of osds being weighted up and down. Some of our clusters like a 4, others like 0, most like 2. I think it has to do with how many pgs you have in other pools, but this variable allows for variations between clusters.
5) Running this script will make zero modifications to your cluster. It's purpose is to generate a crush map for you to test with the crushtool and by uploading to your cluster with the necessary flags set.
6) This script assumes that your pg_num is a power of 2. If your pg_num is not a power of 2, then some of your pgs are twice as big as other pgs and balancing by how many pgs an osd has will result in an imbalanced cluster.
The idea/theory for making this work for a cluster with multiple pools sharing the data is to calculate how much a pg for each pool is worth (based on the % of data in each pool) and sum the weighted values of each pg that an osd has to know if it needs to gain or lose pgs.
I have generated maps using a modified version of this tool for a cluster with a data and cache pool using separate disks in separate roots which worked quite well. The modifications were to balance each pool one at a time with hardcoded output supplied from the owner of the cluster for the replica sizes, pool number, osd tree, and osd df.
Let me know what you think. I know that this has worked extremely well for my co-workers and myself, but we have very limited variety in our setups.
_______________________________________________ Ceph-large mailing list Ceph-large@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com