Hi, I've been testing your script for a while. For single pool it works nice, but most of our clusters use multiple pools (vms, volumes, images, etc.). I wrote a simpler version (it only checks on PG number, not disk space) using python, it allows to work on all pools (not specify pool ID in osdmaptool) but results are bad. Sometimes cluster is unable to balance after applying such map. Also PG distribution from each pool is not equal (some hosts might not have PG from some pools). Have you created/tested 'weighted' version of this tool? Attached results from applying 'balanced' crushmap on test cluster with single pool and my (dirty) python script. There is a little bug in your script preventing it from stopping after specified number of passes: --- offline_reweight_tool.sh.orig 2017-01-10 09:17:54.735290710 +0000 +++ offline_reweight_tool.sh 2017-01-04 08:47:47.065105637 +0000 @@ -107,7 +107,7 @@ variable= ;; passes) - if [ $passes -gt 0 ] + if [ $i -gt 0 ] then passes=$i else Regards, PS On 11/23/2016 06:38 PM, David Turner wrote: > I decided to start a new thread to discuss this tool. I added in some > comments and removed a few things specific to our environment (like > needing to run ceph as sudo because we have our ceph config files > readable only by root). > > To answer Tomasz's question. We have our down_out interval set really > high so that when an OSD goes down, we go in and generate a new map > before we remove the drive so it only backfills once. With that it > moves data much less because you don't backfill when it goes out and > then again to balance the cluster. Generally this backfilling is > about the same as the backfill that happens automatically when the osd > goes out. > > > In it's current incarnation... > > 1) This script is capable of balancing a cluster with 1 pool that has > a vast majority of all of the data (hardcoded to rbd, but easily > changeable) > 2) It is assumed that all of your drives are larger than 1000GB for > how it calculates how many pgs you should have per TB. > 3) It works by changing weights on the crush map until all osds are > within 2 pgs of each other for the primary data pool. > 4) The --offset option is pivotal to balancing the map. Test this > setting going up and down until you have the best set of osds being > weighted up and down. Some of our clusters like a 4, others like 0, > most like 2. I think it has to do with how many pgs you have in other > pools, but this variable allows for variations between clusters. > 5) Running this script will make zero modifications to your cluster. > It's purpose is to generate a crush map for you to test with the > crushtool and by uploading to your cluster with the necessary flags set. > 6) This script assumes that your pg_num is a power of 2. If your > pg_num is not a power of 2, then some of your pgs are twice as big as > other pgs and balancing by how many pgs an osd has will result in an > imbalanced cluster. > > > The idea/theory for making this work for a cluster with multiple pools > sharing the data is to calculate how much a pg for each pool is worth > (based on the % of data in each pool) and sum the weighted values of > each pg that an osd has to know if it needs to gain or lose pgs. > > I have generated maps using a modified version of this tool for a > cluster with a data and cache pool using separate disks in separate > roots which worked quite well. The modifications were to balance each > pool one at a time with hardcoded output supplied from the owner of > the cluster for the replica sizes, pool number, osd tree, and osd df. > > Let me know what you think. I know that this has worked extremely > well for my co-workers and myself, but we have very limited variety in > our setups. > ------------------------------------------------------------------------ > > <https://storagecraft.com> David Turner | Cloud Operations Engineer | > StorageCraft Technology Corporation <https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760| Mobile: 385.224.2943 > > ------------------------------------------------------------------------ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > > ------------------------------------------------------------------------ > > > _______________________________________________ > Ceph-large mailing list > Ceph-large@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 4.65797 1.00000 5554G 1189G 4364G 21.42 0.99 113 10 5.17633 1.00000 5554G 1203G 4351G 21.66 1.00 107 13 4.37465 1.00000 5554G 1232G 4322G 22.18 1.02 110 15 4.32748 1.00000 5554G 1188G 4365G 21.40 0.98 100 16 3.75424 1.00000 5554G 1202G 4351G 21.65 1.00 105 17 4.02464 1.00000 5554G 1229G 4324G 22.14 1.02 105 1 3.79211 1.00000 5554G 1204G 4349G 21.69 1.00 105 3 4.89978 1.00000 5554G 1201G 4352G 21.64 1.00 110 9 4.97583 1.00000 5554G 1216G 4337G 21.91 1.01 107 12 3.86906 1.00000 5554G 1216G 4337G 21.91 1.01 104 11 4.63477 1.00000 5554G 1215G 4338G 21.89 1.01 102 14 4.88583 1.00000 5554G 1190G 4363G 21.43 0.99 112 2 4.28009 1.00000 5554G 1205G 4348G 21.71 1.00 102 5 4.50400 1.00000 5554G 1204G 4349G 21.69 1.00 116 4 4.69147 1.00000 5554G 1201G 4352G 21.64 1.00 112 6 4.08403 1.00000 5554G 1232G 4321G 22.20 1.02 101 7 3.34958 1.00000 5554G 1198G 4355G 21.59 0.99 90 8 4.72093 1.00000 5554G 1201G 4352G 21.64 1.00 119 TOTAL 99977G 21738G 78238G 21.74 MIN/MAX VAR: 0.98/1.02 STDDEV: 0.24
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 5.42000 1.00000 5554G 1148G 4405G 20.68 0.95 109 10 5.42000 1.00000 5554G 991G 4562G 17.86 0.82 88 13 5.42000 1.00000 5554G 1175G 4378G 21.16 0.97 106 15 5.42000 1.00000 5554G 1201G 4352G 21.64 1.00 103 16 5.42000 1.00000 5554G 1385G 4168G 24.95 1.15 121 17 5.42000 1.00000 5554G 1342G 4212G 24.17 1.11 113 1 5.42000 1.00000 5554G 1388G 4166G 24.99 1.15 124 3 5.42000 1.00000 5554G 1173G 4381G 21.12 0.97 108 9 5.42000 1.00000 5554G 1118G 4436G 20.13 0.93 99 12 5.42000 1.00000 5554G 1230G 4323G 22.15 1.02 105 11 5.42000 1.00000 5554G 1131G 4422G 20.37 0.94 97 14 5.42000 1.00000 5554G 1204G 4349G 21.68 1.00 107 2 5.42000 1.00000 5554G 1177G 4376G 21.20 0.98 100 5 5.42000 1.00000 5554G 1077G 4476G 19.40 0.89 105 4 5.42000 1.00000 5554G 1101G 4452G 19.84 0.91 102 6 5.42000 1.00000 5554G 1374G 4179G 24.75 1.14 113 7 5.42000 1.00000 5554G 1453G 4100G 26.17 1.20 115 8 5.42000 1.00000 5554G 1061G 4493G 19.10 0.88 105 TOTAL 99977G 21738G 78238G 21.74 MIN/MAX VAR: 0.82/1.20 STDDEV: 2.28
#!/usr/bin/python # PYTHON_ARGCOMPLETE_OK ## import os import re import sys import time import shutil import pprint import signal import argparse import subprocess import argcomplete ## def test_new_weights(osdmap_fh, crushmap_fh, pool=None): ## weights = {} pg_stats = { 'min_pg': 100000000, 'max_pg': 0, } ## cmd = 'osdmaptool /dev/fd/{} --import-crush /dev/fd/{} --test-map-pgs --mark-up-in --clear-temp 2>/dev/null'.format( osdmap_fh.fileno(), crushmap_fh.fileno()) if pool is not None: cmd += ' --pool {}'.format(pool) osdmaptool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp) ## for line in osdmaptool.split('\n'): m = re.match(r'^osd\.(?P<id>\d+)\s+(?P<pg_count>\d+)\s+(?P<first>\d+)\s+(?P<primary>\d+)\s+(?P<crush_weight>\d+(\.\d+)?)\s+(?P<weight>\d+(\.\d+)?)$', line) k = re.match(r'^\s+avg\s+(?P<avg>\d+)\s+stddev\s+(?P<stddev>\d+(\.\d+)?)\s+', line) if m: weights[m.group('id')] = { 'pg_count': float(m.group('pg_count')), 'crush_weight': float(m.group('crush_weight')), } if float(m.group('pg_count')) > pg_stats['max_pg']: pg_stats['max_pg'] = float(m.group('pg_count')) if float(m.group('pg_count')) < pg_stats['min_pg']: pg_stats['min_pg'] = float(m.group('pg_count')) elif k: pg_stats['avg'] = float(k.group('avg')) pg_stats['stddev'] = float(k.group('stddev')) ## return (weights, pg_stats) ## def update_crush_weights(crushmap_fh, old_weights, avg_pg_count, change_step): stats = { 'up': 0, 'down': 0, } for osd_id in old_weights: osd = old_weights[osd_id] if osd['pg_count'] < avg_pg_count: new_osd_weight = osd['crush_weight'] * (1.0 + (change_step / 10000.0)) cmd = 'crushtool -i /dev/fd/{} -o /dev/fd/{} --reweight-item osd.{} {}'.format( crushmap_fh.fileno(), crushmap_fh.fileno(), osd_id, new_osd_weight) crushtool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp) stats['up'] += 1 elif osd['pg_count'] > avg_pg_count: new_osd_weight = osd['crush_weight'] * (1.0 - (change_step / 10000.0)) cmd = 'crushtool -i /dev/fd/{} -o /dev/fd/{} --reweight-item osd.{} {}'.format( crushmap_fh.fileno(), crushmap_fh.fileno(), osd_id, new_osd_weight) crushtool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp) stats['down'] += 1 return stats ## finished = False ## def exit_handler(signum, frame): global finished finished = True ## if __name__ == '__main__': ## signal.signal(signal.SIGINT, exit_handler) ## parser = argparse.ArgumentParser() parser.add_argument('osdmap', help='path to osdmap (binary)', default=None, type=str) parser.add_argument('crushmap', help='path to crushmap (binary)', default=None, type=str) parser.add_argument('--target-stddev', help='target stddev', default=1.0, type=float) parser.add_argument('--initial-change-step', help='change step in %%', default=500, type=int) parser.add_argument('--max-rounds', help='max number of round', default=1000, type=int) parser.add_argument('--pg-min-max-diff', help='max acceptable difference beetween min_pg and max_pg', default=0, type=int) parser.add_argument('--pool-id', help='pool id used to calculate distribution', default=None, type=int) argcomplete.autocomplete(parser) args = parser.parse_args() ## change_step = args.initial_change_step target_stddev = args.target_stddev osdmap = args.osdmap original_crushmap = args.crushmap pool = args.pool_id last_stddev = 999999 round_no = 0 ## prepare crushmap copy to operate on crushmap = 'cm_reweight_{}'.format(time.time()) shutil.copy(original_crushmap, crushmap) ## with open(osdmap, 'r') as osdmap_fh, open(crushmap, 'r+') as crushmap_fh: print 'working on {}'.format(crushmap) while not finished and round_no < args.max_rounds: round_no += 1 (weights, pg_stats) = test_new_weights(osdmap_fh, crushmap_fh, pool=pool) ## if last_stddev < pg_stats['stddev']: if change_step > 100: change_step -= 10 print '\r\nlowering change_step to {}'.format(change_step) elif change_step > 1: change_step -= 1 print '\r\nlowering change_step to {}'.format(change_step) last_stddev = pg_stats['stddev'] ## if pg_stats['stddev'] <= target_stddev or pg_stats['max_pg'] - pg_stats['min_pg'] <= args.pg_min_max_diff: finished = True break update_stats = update_crush_weights(crushmap_fh, weights, pg_stats['avg'], change_step) sys.stdout.write('\rround: {:5.0f}, stddev: {:8.4f}, up: {:4.0f}, down: {:4.0f}, min_pg: {:4.0f}, max_pg: {:4.0f}'.format( round_no, pg_stats['stddev'], update_stats['up'], update_stats['down'], pg_stats['min_pg'], pg_stats['max_pg'])) sys.stdout.flush() print '' pprint.pprint(weights) pprint.pprint(pg_stats) print 'to apply run:\r\n\tceph osd setcrushmap -i {}'.format(crushmap)
_______________________________________________ Ceph-large mailing list Ceph-large@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com