Re: Crush Offline Reweighting tool

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Wed, 18 Jan 2017 19:05:01 +0000

I just realized I'm using the wrong number to calculate how much % of data a pool has.  I'll modify the script to use the better number later.










David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943










If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.









From: Paweł Sadowski [ceph@xxxxxxxxx]

Sent: Friday, January 13, 2017 12:12 PM

To: David Turner

Cc: ceph-large@xxxxxxxxxxxxxx

Subject: Re: [Ceph-large] Crush Offline Reweighting tool






Hi,
For sure I'll try it on Monday. In the meantime I've updated my script with some initial support for multiple pools based on previous one and your ideas. I've tested it on some clusters and it allows to get some pretty good results (disk space within 3%
 range). But I noticed that on clusters where racks weight differ a lot, there is uneven distribution of
primary OSDs. Did you noticed similar behavior (you can check it in osdmaptool output). For such cluster I also had to increase
choose_total_tries tunable to allow cluster to finish rebalancing (as in your example, PGs were stuck in
active+remapped state). This is mostly for the case when we grow cluster for example of three racks adding some new OSDs in fourth one (or n-th one) when we don't want to/can't fill full rack at once (having much smaller weight the three others). Btw.
 we use failure domain of rack with replica size 3.
Have a nice weekend,


On 13.01.2017 19:36, David Turner wrote:



I have a beta version that should balance a map for a cluster with multiple pools of varying amounts of data.  The logic is roughly what I came up
 with on Tuesday



I don't have a good dev setup to test it, but I know that it's generating maps that look pretty good.  Would you be able to try some testing with this new version, Pawel Sadowski?



There was a thread on the ceph-users ML recently where someone had a map I generated for them stuck with 4 active-remapped PGs.  In their case, they didn't understand ceph at all and kept adding in OSDs without regard for balanced placement (3 nodes with 4TB,
 1 node with 32 TB, and another with 16TB with replica size 3...).  In any case, modifying --set-choose-total-tries to 100 worked for them to let the cluster finish backfilling those last 4 PGs.  I mention that as it isn't likely to happen for anyone that understands
 how to design a cluster, but that it could come up.








David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943









If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.





________________________________________

From: Paweł Sadowski [ceph@xxxxxxxxx]

Sent: Tuesday, January 10, 2017 2:38 AM

To: David Turner

Cc: 
ceph-large@xxxxxxxxxxxxxx

Subject: Re: [Ceph-large] Crush Offline Reweighting tool



Hi,





I've been testing your script for a while. For single pool it works

nice, but most of our clusters use multiple pools (vms, volumes, images,

etc.). I wrote a simpler version (it only checks on PG number, not disk

space) using python, it allows to work on all pools (not specify pool ID

in osdmaptool) but results are bad. Sometimes cluster is unable to

balance after applying such map. Also PG distribution from each pool is

not equal (some hosts might not have PG from some pools). Have you

created/tested 'weighted' version of this tool?





Attached results from applying 'balanced' crushmap on test cluster with

single pool and my (dirty) python script.





There is a little bug in your script preventing it from stopping after

specified number of passes:

--- offline_reweight_tool.sh.orig    2017-01-10 09:17:54.735290710 +0000

+++ offline_reweight_tool.sh    2017-01-04 08:47:47.065105637 +0000

@@ -107,7 +107,7 @@

            variable=

        ;;

                passes)

-                        if [ $passes -gt 0 ]

+                        if [ $i -gt 0 ]

                        then

                                passes=$i

                        else





Regards,

PS





On 11/23/2016 06:38 PM, David Turner wrote:

> I decided to start a new thread to discuss this tool.  I added in some

> comments and removed a few things specific to our environment (like

> needing to run ceph as sudo because we have our ceph config files

> readable only by root).

>

> To answer Tomasz's question.  We have our down_out interval set really

> high so that when an OSD goes down, we go in and generate a new map

> before we remove the drive so it only backfills once.  With that it

> moves data much less because you don't backfill when it goes out and

> then again to balance the cluster.  Generally this backfilling is

> about the same as the backfill that happens automatically when the osd

> goes out.

>

>

> In it's current incarnation...

>

> 1) This script is capable of balancing a cluster with 1 pool that has

> a vast majority of all of the data (hardcoded to rbd, but easily

> changeable)

> 2) It is assumed that all of your drives are larger than 1000GB for

> how it calculates how many pgs you should have per TB.

> 3) It works by changing weights on the crush map until all osds are

> within 2 pgs of each other for the primary data pool.

> 4) The --offset option is pivotal to balancing the map.  Test this

> setting going up and down until you have the best set of osds being

> weighted up and down.  Some of our clusters like a 4, others like 0,

> most like 2.  I think it has to do with how many pgs you have in other

> pools, but this variable allows for variations between clusters.

> 5) Running this script will make zero modifications to your cluster.

> It's purpose is to generate a crush map for you to test with the

> crushtool and by uploading to your cluster with the necessary flags set.

> 6) This script assumes that your pg_num is a power of 2.  If your

> pg_num is not a power of 2, then some of your pgs are twice as big as

> other pgs and balancing by how many pgs an osd has will result in an

> imbalanced cluster.

>

>

> The idea/theory for making this work for a cluster with multiple pools

> sharing the data is to calculate how much a pg for each pool is worth

> (based on the % of data in each pool) and sum the weighted values of

> each pg that an osd has to know if it needs to gain or lose pgs.

>

> I have generated maps using a modified version of this tool for a

> cluster with a data and cache pool using separate disks in separate

> roots which worked quite well.  The modifications were to balance each

> pool one at a time with hardcoded output supplied from the owner of

> the cluster for the replica sizes, pool number, osd tree, and osd df.

>

> Let me know what you think.  I know that this has worked extremely

> well for my co-workers and myself, but we have very limited variety in

> our setups.

> ------------------------------------------------------------------------

>

> 
<https://storagecraft.com>    David Turner | Cloud Operations Engineer |

> StorageCraft Technology Corporation 
<https://storagecraft.com>

> 380 Data Drive Suite 300 | Draper | Utah | 84020

> Office: 801.871.2760| Mobile: 385.224.2943

>

> ------------------------------------------------------------------------

>

> If you are not the intended recipient of this message or received it

> erroneously, please notify the sender and delete it, together with any

> attachments, and be advised that any dissemination or copying of this

> message is prohibited.

>

> ------------------------------------------------------------------------

>

>

> _______________________________________________

> Ceph-large mailing list

> 
Ceph-large@xxxxxxxxxxxxxx

> 
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com









-- 
PS






_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com