Re: Crush Offline Reweighting tool

Paweł Sadowski <ceph@xxxxxxxxx> · Tue, 10 Jan 2017 10:38:50 +0100

Hi,


I've been testing your script for a while. For single pool it works
nice, but most of our clusters use multiple pools (vms, volumes, images,
etc.). I wrote a simpler version (it only checks on PG number, not disk
space) using python, it allows to work on all pools (not specify pool ID
in osdmaptool) but results are bad. Sometimes cluster is unable to
balance after applying such map. Also PG distribution from each pool is
not equal (some hosts might not have PG from some pools). Have you
created/tested 'weighted' version of this tool?


Attached results from applying 'balanced' crushmap on test cluster with
single pool and my (dirty) python script.


There is a little bug in your script preventing it from stopping after
specified number of passes:

--- offline_reweight_tool.sh.orig    2017-01-10 09:17:54.735290710 +0000
+++ offline_reweight_tool.sh    2017-01-04 08:47:47.065105637 +0000
@@ -107,7 +107,7 @@
             variable=
         ;;
                 passes)
-                        if [ $passes -gt 0 ]
+                        if [ $i -gt 0 ]
                         then
                                 passes=$i
                         else


Regards,
PS


On 11/23/2016 06:38 PM, David Turner wrote:
> I decided to start a new thread to discuss this tool.  I added in some
> comments and removed a few things specific to our environment (like
> needing to run ceph as sudo because we have our ceph config files
> readable only by root).
>
> To answer Tomasz's question.  We have our down_out interval set really
> high so that when an OSD goes down, we go in and generate a new map
> before we remove the drive so it only backfills once.  With that it
> moves data much less because you don't backfill when it goes out and
> then again to balance the cluster.  Generally this backfilling is
> about the same as the backfill that happens automatically when the osd
> goes out.
>
>
> In it's current incarnation...
>
> 1) This script is capable of balancing a cluster with 1 pool that has
> a vast majority of all of the data (hardcoded to rbd, but easily
> changeable)
> 2) It is assumed that all of your drives are larger than 1000GB for
> how it calculates how many pgs you should have per TB.
> 3) It works by changing weights on the crush map until all osds are
> within 2 pgs of each other for the primary data pool.
> 4) The --offset option is pivotal to balancing the map.  Test this
> setting going up and down until you have the best set of osds being
> weighted up and down.  Some of our clusters like a 4, others like 0,
> most like 2.  I think it has to do with how many pgs you have in other
> pools, but this variable allows for variations between clusters.
> 5) Running this script will make zero modifications to your cluster. 
> It's purpose is to generate a crush map for you to test with the
> crushtool and by uploading to your cluster with the necessary flags set.
> 6) This script assumes that your pg_num is a power of 2.  If your
> pg_num is not a power of 2, then some of your pgs are twice as big as
> other pgs and balancing by how many pgs an osd has will result in an
> imbalanced cluster.
>
>
> The idea/theory for making this work for a cluster with multiple pools
> sharing the data is to calculate how much a pg for each pool is worth
> (based on the % of data in each pool) and sum the weighted values of
> each pg that an osd has to know if it needs to gain or lose pgs.
>
> I have generated maps using a modified version of this tool for a
> cluster with a data and cache pool using separate disks in separate
> roots which worked quite well.  The modifications were to balance each
> pool one at a time with hardcoded output supplied from the owner of
> the cluster for the replica sizes, pool number, osd tree, and osd df.
>
> Let me know what you think.  I know that this has worked extremely
> well for my co-workers and myself, but we have very limited variety in
> our setups.
> ------------------------------------------------------------------------
>
> <https://storagecraft.com> 	David Turner | Cloud Operations Engineer |
> StorageCraft Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760| Mobile: 385.224.2943
>
> ------------------------------------------------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> ------------------------------------------------------------------------
>
>
> _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com


ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS 
 0 4.65797  1.00000  5554G  1189G  4364G 21.42 0.99 113 
10 5.17633  1.00000  5554G  1203G  4351G 21.66 1.00 107 
13 4.37465  1.00000  5554G  1232G  4322G 22.18 1.02 110 
15 4.32748  1.00000  5554G  1188G  4365G 21.40 0.98 100 
16 3.75424  1.00000  5554G  1202G  4351G 21.65 1.00 105 
17 4.02464  1.00000  5554G  1229G  4324G 22.14 1.02 105 
 1 3.79211  1.00000  5554G  1204G  4349G 21.69 1.00 105 
 3 4.89978  1.00000  5554G  1201G  4352G 21.64 1.00 110 
 9 4.97583  1.00000  5554G  1216G  4337G 21.91 1.01 107 
12 3.86906  1.00000  5554G  1216G  4337G 21.91 1.01 104 
11 4.63477  1.00000  5554G  1215G  4338G 21.89 1.01 102 
14 4.88583  1.00000  5554G  1190G  4363G 21.43 0.99 112 
 2 4.28009  1.00000  5554G  1205G  4348G 21.71 1.00 102 
 5 4.50400  1.00000  5554G  1204G  4349G 21.69 1.00 116 
 4 4.69147  1.00000  5554G  1201G  4352G 21.64 1.00 112 
 6 4.08403  1.00000  5554G  1232G  4321G 22.20 1.02 101 
 7 3.34958  1.00000  5554G  1198G  4355G 21.59 0.99  90 
 8 4.72093  1.00000  5554G  1201G  4352G 21.64 1.00 119 
              TOTAL 99977G 21738G 78238G 21.74          
MIN/MAX VAR: 0.98/1.02  STDDEV: 0.24

ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS 
 0 5.42000  1.00000  5554G  1148G  4405G 20.68 0.95 109 
10 5.42000  1.00000  5554G   991G  4562G 17.86 0.82  88 
13 5.42000  1.00000  5554G  1175G  4378G 21.16 0.97 106 
15 5.42000  1.00000  5554G  1201G  4352G 21.64 1.00 103 
16 5.42000  1.00000  5554G  1385G  4168G 24.95 1.15 121 
17 5.42000  1.00000  5554G  1342G  4212G 24.17 1.11 113 
 1 5.42000  1.00000  5554G  1388G  4166G 24.99 1.15 124 
 3 5.42000  1.00000  5554G  1173G  4381G 21.12 0.97 108 
 9 5.42000  1.00000  5554G  1118G  4436G 20.13 0.93  99 
12 5.42000  1.00000  5554G  1230G  4323G 22.15 1.02 105 
11 5.42000  1.00000  5554G  1131G  4422G 20.37 0.94  97 
14 5.42000  1.00000  5554G  1204G  4349G 21.68 1.00 107 
 2 5.42000  1.00000  5554G  1177G  4376G 21.20 0.98 100 
 5 5.42000  1.00000  5554G  1077G  4476G 19.40 0.89 105 
 4 5.42000  1.00000  5554G  1101G  4452G 19.84 0.91 102 
 6 5.42000  1.00000  5554G  1374G  4179G 24.75 1.14 113 
 7 5.42000  1.00000  5554G  1453G  4100G 26.17 1.20 115 
 8 5.42000  1.00000  5554G  1061G  4493G 19.10 0.88 105 
              TOTAL 99977G 21738G 78238G 21.74          
MIN/MAX VAR: 0.82/1.20  STDDEV: 2.28

#!/usr/bin/python

# PYTHON_ARGCOMPLETE_OK

##
import os
import re
import sys
import time
import shutil
import pprint
import signal
import argparse
import subprocess
import argcomplete

##
def test_new_weights(osdmap_fh, crushmap_fh, pool=None):
    ##
    weights = {}
    pg_stats = {
        'min_pg': 100000000,
        'max_pg': 0,
    }
    ##
    cmd = 'osdmaptool /dev/fd/{} --import-crush /dev/fd/{} --test-map-pgs --mark-up-in --clear-temp 2>/dev/null'.format(
            osdmap_fh.fileno(), crushmap_fh.fileno())
    if pool is not None:
        cmd += ' --pool {}'.format(pool)
    osdmaptool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)
    ##
    for line in osdmaptool.split('\n'):
        m = re.match(r'^osd\.(?P<id>\d+)\s+(?P<pg_count>\d+)\s+(?P<first>\d+)\s+(?P<primary>\d+)\s+(?P<crush_weight>\d+(\.\d+)?)\s+(?P<weight>\d+(\.\d+)?)$', line)
        k = re.match(r'^\s+avg\s+(?P<avg>\d+)\s+stddev\s+(?P<stddev>\d+(\.\d+)?)\s+', line)
        if m:
            weights[m.group('id')] = {
                'pg_count': float(m.group('pg_count')),
                'crush_weight': float(m.group('crush_weight')),
            }
            if float(m.group('pg_count')) > pg_stats['max_pg']:
                pg_stats['max_pg'] = float(m.group('pg_count'))
            if float(m.group('pg_count')) < pg_stats['min_pg']:
                pg_stats['min_pg'] = float(m.group('pg_count'))
        elif k:
            pg_stats['avg']  = float(k.group('avg'))
            pg_stats['stddev'] = float(k.group('stddev'))
    ##
    return (weights, pg_stats)

##
def update_crush_weights(crushmap_fh, old_weights, avg_pg_count, change_step):
    stats = {
        'up': 0,
        'down': 0,
    }
    for osd_id in old_weights:
        osd = old_weights[osd_id]
        if osd['pg_count'] < avg_pg_count:
            new_osd_weight = osd['crush_weight'] * (1.0 + (change_step / 10000.0))
            cmd = 'crushtool -i /dev/fd/{} -o /dev/fd/{} --reweight-item osd.{} {}'.format(
                    crushmap_fh.fileno(), crushmap_fh.fileno(), osd_id, new_osd_weight)
            crushtool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)
            stats['up'] += 1
        elif osd['pg_count'] > avg_pg_count:
            new_osd_weight = osd['crush_weight'] * (1.0 - (change_step / 10000.0))
            cmd = 'crushtool -i /dev/fd/{} -o /dev/fd/{} --reweight-item osd.{} {}'.format(
                    crushmap_fh.fileno(), crushmap_fh.fileno(), osd_id, new_osd_weight)
            crushtool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)
            stats['down'] += 1
    return stats

##
finished = False

##
def exit_handler(signum, frame):
    global finished
    finished = True



##
if __name__ == '__main__':
    ##
    signal.signal(signal.SIGINT, exit_handler)
    ##
    parser = argparse.ArgumentParser()
    parser.add_argument('osdmap', help='path to osdmap (binary)',
                        default=None, type=str)
    parser.add_argument('crushmap', help='path to crushmap (binary)',
                        default=None, type=str)
    parser.add_argument('--target-stddev', help='target stddev',
                        default=1.0, type=float)
    parser.add_argument('--initial-change-step', help='change step in %%',
                        default=500, type=int)
    parser.add_argument('--max-rounds', help='max number of round',
                        default=1000, type=int)
    parser.add_argument('--pg-min-max-diff', help='max acceptable difference beetween min_pg and max_pg',
                        default=0, type=int)
    parser.add_argument('--pool-id', help='pool id used to calculate distribution',
                        default=None, type=int)
    argcomplete.autocomplete(parser)
    args = parser.parse_args()
    ##
    change_step = args.initial_change_step
    target_stddev = args.target_stddev
    osdmap = args.osdmap
    original_crushmap = args.crushmap
    pool = args.pool_id
    last_stddev = 999999
    round_no = 0
    ## prepare crushmap copy to operate on 
    crushmap = 'cm_reweight_{}'.format(time.time())
    shutil.copy(original_crushmap, crushmap)
    ##
    with open(osdmap, 'r') as osdmap_fh, open(crushmap, 'r+') as crushmap_fh:
        print 'working on {}'.format(crushmap)
        while not finished and round_no < args.max_rounds:
            round_no += 1
            (weights, pg_stats) = test_new_weights(osdmap_fh, crushmap_fh, pool=pool)
            ##
            if last_stddev < pg_stats['stddev']:
                if change_step > 100:
                    change_step -= 10
                    print '\r\nlowering change_step to {}'.format(change_step)
                elif change_step > 1:
                    change_step -= 1
                    print '\r\nlowering change_step to {}'.format(change_step)
            last_stddev = pg_stats['stddev']
            ##
            if pg_stats['stddev'] <= target_stddev or pg_stats['max_pg'] - pg_stats['min_pg'] <= args.pg_min_max_diff:
                finished = True
                break
            update_stats = update_crush_weights(crushmap_fh, weights, pg_stats['avg'], change_step)
            sys.stdout.write('\rround: {:5.0f}, stddev: {:8.4f}, up: {:4.0f}, down: {:4.0f}, min_pg: {:4.0f}, max_pg: {:4.0f}'.format(
                        round_no, pg_stats['stddev'], update_stats['up'], update_stats['down'], pg_stats['min_pg'], pg_stats['max_pg']))
            sys.stdout.flush()
        print ''
        pprint.pprint(weights)
        pprint.pprint(pg_stats)
        print 'to apply run:\r\n\tceph osd setcrushmap -i {}'.format(crushmap)
        

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com