I noticed that bug as well about the passes. I'm glad you got it working.
The way that the script needs to be modified for clusters with multiple pools would be the opposite of ignoring which pool a PG comes from and care about which pool every PG comes from and how much % of data that pool has in it.
An option for setting up weights for this would be...
A cluster with 3 pools. It's currently 50% full and 30% of that used space is in the vms pool, 55% of that used space is in the volumes pool, and 15% of that used space is in the images pool. There are 4096 PGs in the vms pool, 8192 PGs in the volumes pool, and 1024 PGs in the images pool. The cluster has 1,000 TB total raw space.
These numbers are gathered from various places of the script and now stored in adequate variables. The next thing we realize is that each TB in the cluster should have 4 PGs from the vms pool, 8 PGs from the volumes pool, and 1 PG from the images pool. Using the % of used space for each pool as our weight for each of these PGs (giving PGs from the vms pool a .3, PGs from the volumes pool a .55, and PGs from the images pool a .15) and multiplying that weight by the number of PGs each TB should have in the cluster, we get a weight of 5.75 for each TB in the cluster. An example OSD with 4TB should have a total PG weight of 23.
Now that we know what this OSD should have on it, we use the osdmaptool run against each pool specifically and note how many PGs it has from each pool multiplying the count by the weight for each pool. So if this 4TB osd has 18 PGs from the vms pool, 30 PGs from the volumes pool, and 7 PGs from the images pool... the sum of it's weighted PGs is 22.95 which is spot on. However if another 4TB osd has 10 PGs from vms, 24 from volumes, and 20 from images... then the sum of it's weighted PGs is 19.2 and it will need to be weighted up.
The script in it's 1 pool version only cares about how many PGs an OSD has in the primary pool and weights up/down if the total is more than 1 PG away from the average. With this weighting modification, it would instead reweight if the OSD's PG weight was further than X away from the average. X would need to be calculated and tested to see how close/far away you need to be from average to get a map that can be backfilled onto every time.
As a side note, this script should exit out immediately if a cluster does not have a power of 2 amount of PGs for any pools.
The way that the script needs to be modified for clusters with multiple pools would be the opposite of ignoring which pool a PG comes from and care about which pool every PG comes from and how much % of data that pool has in it.
An option for setting up weights for this would be...
A cluster with 3 pools. It's currently 50% full and 30% of that used space is in the vms pool, 55% of that used space is in the volumes pool, and 15% of that used space is in the images pool. There are 4096 PGs in the vms pool, 8192 PGs in the volumes pool, and 1024 PGs in the images pool. The cluster has 1,000 TB total raw space.
These numbers are gathered from various places of the script and now stored in adequate variables. The next thing we realize is that each TB in the cluster should have 4 PGs from the vms pool, 8 PGs from the volumes pool, and 1 PG from the images pool. Using the % of used space for each pool as our weight for each of these PGs (giving PGs from the vms pool a .3, PGs from the volumes pool a .55, and PGs from the images pool a .15) and multiplying that weight by the number of PGs each TB should have in the cluster, we get a weight of 5.75 for each TB in the cluster. An example OSD with 4TB should have a total PG weight of 23.
Now that we know what this OSD should have on it, we use the osdmaptool run against each pool specifically and note how many PGs it has from each pool multiplying the count by the weight for each pool. So if this 4TB osd has 18 PGs from the vms pool, 30 PGs from the volumes pool, and 7 PGs from the images pool... the sum of it's weighted PGs is 22.95 which is spot on. However if another 4TB osd has 10 PGs from vms, 24 from volumes, and 20 from images... then the sum of it's weighted PGs is 19.2 and it will need to be weighted up.
The script in it's 1 pool version only cares about how many PGs an OSD has in the primary pool and weights up/down if the total is more than 1 PG away from the average. With this weighting modification, it would instead reweight if the OSD's PG weight was further than X away from the average. X would need to be calculated and tested to see how close/far away you need to be from average to get a map that can be backfilled onto every time.
As a side note, this script should exit out immediately if a cluster does not have a power of 2 amount of PGs for any pools.
David Turner |
Cloud Operations Engineer |
StorageCraft
Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2760 | Mobile: 385.224.2943 |
If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. |
________________________________________
From: Paweł Sadowski [ceph@xxxxxxxxx]
Sent: Tuesday, January 10, 2017 2:38 AM
To: David Turner
Cc: ceph-large@xxxxxxxxxxxxxx
Subject: Re: Crush Offline Reweighting tool
Hi,
I've been testing your script for a while. For single pool it works
nice, but most of our clusters use multiple pools (vms, volumes, images,
etc.). I wrote a simpler version (it only checks on PG number, not disk
space) using python, it allows to work on all pools (not specify pool ID
in osdmaptool) but results are bad. Sometimes cluster is unable to
balance after applying such map. Also PG distribution from each pool is
not equal (some hosts might not have PG from some pools). Have you
created/tested 'weighted' version of this tool?
Attached results from applying 'balanced' crushmap on test cluster with
single pool and my (dirty) python script.
There is a little bug in your script preventing it from stopping after
specified number of passes:
--- offline_reweight_tool.sh.orig 2017-01-10 09:17:54.735290710 +0000
+++ offline_reweight_tool.sh 2017-01-04 08:47:47.065105637 +0000
@@ -107,7 +107,7 @@
variable=
;;
passes)
- if [ $passes -gt 0 ]
+ if [ $i -gt 0 ]
then
passes=$i
else
Regards,
PS
On 11/23/2016 06:38 PM, David Turner wrote:
> I decided to start a new thread to discuss this tool. I added in some
> comments and removed a few things specific to our environment (like
> needing to run ceph as sudo because we have our ceph config files
> readable only by root).
>
> To answer Tomasz's question. We have our down_out interval set really
> high so that when an OSD goes down, we go in and generate a new map
> before we remove the drive so it only backfills once. With that it
> moves data much less because you don't backfill when it goes out and
> then again to balance the cluster. Generally this backfilling is
> about the same as the backfill that happens automatically when the osd
> goes out.
>
>
> In it's current incarnation...
>
> 1) This script is capable of balancing a cluster with 1 pool that has
> a vast majority of all of the data (hardcoded to rbd, but easily
> changeable)
> 2) It is assumed that all of your drives are larger than 1000GB for
> how it calculates how many pgs you should have per TB.
> 3) It works by changing weights on the crush map until all osds are
> within 2 pgs of each other for the primary data pool.
> 4) The --offset option is pivotal to balancing the map. Test this
> setting going up and down until you have the best set of osds being
> weighted up and down. Some of our clusters like a 4, others like 0,
> most like 2. I think it has to do with how many pgs you have in other
> pools, but this variable allows for variations between clusters.
> 5) Running this script will make zero modifications to your cluster.
> It's purpose is to generate a crush map for you to test with the
> crushtool and by uploading to your cluster with the necessary flags set.
> 6) This script assumes that your pg_num is a power of 2. If your
> pg_num is not a power of 2, then some of your pgs are twice as big as
> other pgs and balancing by how many pgs an osd has will result in an
> imbalanced cluster.
>
>
> The idea/theory for making this work for a cluster with multiple pools
> sharing the data is to calculate how much a pg for each pool is worth
> (based on the % of data in each pool) and sum the weighted values of
> each pg that an osd has to know if it needs to gain or lose pgs.
>
> I have generated maps using a modified version of this tool for a
> cluster with a data and cache pool using separate disks in separate
> roots which worked quite well. The modifications were to balance each
> pool one at a time with hardcoded output supplied from the owner of
> the cluster for the replica sizes, pool number, osd tree, and osd df.
>
> Let me know what you think. I know that this has worked extremely
> well for my co-workers and myself, but we have very limited variety in
> our setups.
> ------------------------------------------------------------------------
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer |
> StorageCraft Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760| Mobile: 385.224.2943
>
> ------------------------------------------------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> ------------------------------------------------------------------------
>
>
> _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
From: Paweł Sadowski [ceph@xxxxxxxxx]
Sent: Tuesday, January 10, 2017 2:38 AM
To: David Turner
Cc: ceph-large@xxxxxxxxxxxxxx
Subject: Re: Crush Offline Reweighting tool
Hi,
I've been testing your script for a while. For single pool it works
nice, but most of our clusters use multiple pools (vms, volumes, images,
etc.). I wrote a simpler version (it only checks on PG number, not disk
space) using python, it allows to work on all pools (not specify pool ID
in osdmaptool) but results are bad. Sometimes cluster is unable to
balance after applying such map. Also PG distribution from each pool is
not equal (some hosts might not have PG from some pools). Have you
created/tested 'weighted' version of this tool?
Attached results from applying 'balanced' crushmap on test cluster with
single pool and my (dirty) python script.
There is a little bug in your script preventing it from stopping after
specified number of passes:
--- offline_reweight_tool.sh.orig 2017-01-10 09:17:54.735290710 +0000
+++ offline_reweight_tool.sh 2017-01-04 08:47:47.065105637 +0000
@@ -107,7 +107,7 @@
variable=
;;
passes)
- if [ $passes -gt 0 ]
+ if [ $i -gt 0 ]
then
passes=$i
else
Regards,
PS
On 11/23/2016 06:38 PM, David Turner wrote:
> I decided to start a new thread to discuss this tool. I added in some
> comments and removed a few things specific to our environment (like
> needing to run ceph as sudo because we have our ceph config files
> readable only by root).
>
> To answer Tomasz's question. We have our down_out interval set really
> high so that when an OSD goes down, we go in and generate a new map
> before we remove the drive so it only backfills once. With that it
> moves data much less because you don't backfill when it goes out and
> then again to balance the cluster. Generally this backfilling is
> about the same as the backfill that happens automatically when the osd
> goes out.
>
>
> In it's current incarnation...
>
> 1) This script is capable of balancing a cluster with 1 pool that has
> a vast majority of all of the data (hardcoded to rbd, but easily
> changeable)
> 2) It is assumed that all of your drives are larger than 1000GB for
> how it calculates how many pgs you should have per TB.
> 3) It works by changing weights on the crush map until all osds are
> within 2 pgs of each other for the primary data pool.
> 4) The --offset option is pivotal to balancing the map. Test this
> setting going up and down until you have the best set of osds being
> weighted up and down. Some of our clusters like a 4, others like 0,
> most like 2. I think it has to do with how many pgs you have in other
> pools, but this variable allows for variations between clusters.
> 5) Running this script will make zero modifications to your cluster.
> It's purpose is to generate a crush map for you to test with the
> crushtool and by uploading to your cluster with the necessary flags set.
> 6) This script assumes that your pg_num is a power of 2. If your
> pg_num is not a power of 2, then some of your pgs are twice as big as
> other pgs and balancing by how many pgs an osd has will result in an
> imbalanced cluster.
>
>
> The idea/theory for making this work for a cluster with multiple pools
> sharing the data is to calculate how much a pg for each pool is worth
> (based on the % of data in each pool) and sum the weighted values of
> each pg that an osd has to know if it needs to gain or lose pgs.
>
> I have generated maps using a modified version of this tool for a
> cluster with a data and cache pool using separate disks in separate
> roots which worked quite well. The modifications were to balance each
> pool one at a time with hardcoded output supplied from the owner of
> the cluster for the replica sizes, pool number, osd tree, and osd df.
>
> Let me know what you think. I know that this has worked extremely
> well for my co-workers and myself, but we have very limited variety in
> our setups.
> ------------------------------------------------------------------------
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer |
> StorageCraft Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760| Mobile: 385.224.2943
>
> ------------------------------------------------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> ------------------------------------------------------------------------
>
>
> _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
_______________________________________________ Ceph-large mailing list Ceph-large@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com