Re: Crush Offline Reweighting tool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



For sure I'll try it on Monday. In the meantime I've updated my script with some initial support for multiple pools based on previous one and your ideas. I've tested it on some clusters and it allows to get some pretty good results (disk space within 3% range). But I noticed that on clusters where racks weight differ a lot, there is uneven distribution of primary OSDs. Did you noticed similar behavior (you can check it in osdmaptool output). For such cluster I also had to increase choose_total_tries tunable to allow cluster to finish rebalancing (as in your example, PGs were stuck in active+remapped state). This is mostly for the case when we grow cluster for example of three racks adding some new OSDs in fourth one (or n-th one) when we don't want to/can't fill full rack at once (having much smaller weight the three others). Btw. we use failure domain of rack with replica size 3.

Have a nice weekend,

On 13.01.2017 19:36, David Turner wrote:
I have a beta version that should balance a map for a cluster with multiple pools of varying amounts of data.  The logic is roughly what I came up with on Tuesday

I don't have a good dev setup to test it, but I know that it's generating maps that look pretty good.  Would you be able to try some testing with this new version, Pawel Sadowski?

There was a thread on the ceph-users ML recently where someone had a map I generated for them stuck with 4 active-remapped PGs.  In their case, they didn't understand ceph at all and kept adding in OSDs without regard for balanced placement (3 nodes with 4TB, 1 node with 32 TB, and another with 16TB with replica size 3...).  In any case, modifying --set-choose-total-tries to 100 worked for them to let the cluster finish backfilling those last 4 PGs.  I mention that as it isn't likely to happen for anyone that understands how to design a cluster, but that it could come up.

David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.

From: Paweł Sadowski [ceph@xxxxxxxxx]
Sent: Tuesday, January 10, 2017 2:38 AM
To: David Turner
Cc: ceph-large@xxxxxxxxxxxxxx
Subject: Re: Crush Offline Reweighting tool


I've been testing your script for a while. For single pool it works
nice, but most of our clusters use multiple pools (vms, volumes, images,
etc.). I wrote a simpler version (it only checks on PG number, not disk
space) using python, it allows to work on all pools (not specify pool ID
in osdmaptool) but results are bad. Sometimes cluster is unable to
balance after applying such map. Also PG distribution from each pool is
not equal (some hosts might not have PG from some pools). Have you
created/tested 'weighted' version of this tool?

Attached results from applying 'balanced' crushmap on test cluster with
single pool and my (dirty) python script.

There is a little bug in your script preventing it from stopping after
specified number of passes:

---    2017-01-10 09:17:54.735290710 +0000
+++    2017-01-04 08:47:47.065105637 +0000
@@ -107,7 +107,7 @@
-                        if [ $passes -gt 0 ]
+                        if [ $i -gt 0 ]


On 11/23/2016 06:38 PM, David Turner wrote:
> I decided to start a new thread to discuss this tool.  I added in some
> comments and removed a few things specific to our environment (like
> needing to run ceph as sudo because we have our ceph config files
> readable only by root).
> To answer Tomasz's question.  We have our down_out interval set really
> high so that when an OSD goes down, we go in and generate a new map
> before we remove the drive so it only backfills once.  With that it
> moves data much less because you don't backfill when it goes out and
> then again to balance the cluster.  Generally this backfilling is
> about the same as the backfill that happens automatically when the osd
> goes out.
> In it's current incarnation...
> 1) This script is capable of balancing a cluster with 1 pool that has
> a vast majority of all of the data (hardcoded to rbd, but easily
> changeable)
> 2) It is assumed that all of your drives are larger than 1000GB for
> how it calculates how many pgs you should have per TB.
> 3) It works by changing weights on the crush map until all osds are
> within 2 pgs of each other for the primary data pool.
> 4) The --offset option is pivotal to balancing the map.  Test this
> setting going up and down until you have the best set of osds being
> weighted up and down.  Some of our clusters like a 4, others like 0,
> most like 2.  I think it has to do with how many pgs you have in other
> pools, but this variable allows for variations between clusters.
> 5) Running this script will make zero modifications to your cluster.
> It's purpose is to generate a crush map for you to test with the
> crushtool and by uploading to your cluster with the necessary flags set.
> 6) This script assumes that your pg_num is a power of 2.  If your
> pg_num is not a power of 2, then some of your pgs are twice as big as
> other pgs and balancing by how many pgs an osd has will result in an
> imbalanced cluster.
> The idea/theory for making this work for a cluster with multiple pools
> sharing the data is to calculate how much a pg for each pool is worth
> (based on the % of data in each pool) and sum the weighted values of
> each pg that an osd has to know if it needs to gain or lose pgs.
> I have generated maps using a modified version of this tool for a
> cluster with a data and cache pool using separate disks in separate
> roots which worked quite well.  The modifications were to balance each
> pool one at a time with hardcoded output supplied from the owner of
> the cluster for the replica sizes, pool number, osd tree, and osd df.
> Let me know what you think.  I know that this has worked extremely
> well for my co-workers and myself, but we have very limited variety in
> our setups.
> ------------------------------------------------------------------------
> <>    David Turner | Cloud Operations Engineer |
> StorageCraft Technology Corporation <>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760| Mobile: 385.224.2943
> ------------------------------------------------------------------------
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> ------------------------------------------------------------------------
> _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx



import os
import re
import sys
import time
import json
import shutil
import pprint
import signal
import argparse
import subprocess
import argcomplete

finished = False

class NotEnoughPGMemberError(BaseException):

def log(text):
    print text

def parse_osd_df(osd_df_path):
    osd_df = {}
    with open(osd_df_path, 'r') as fh:
        df = json.loads(
        #osd_df = {node[u'id']: node for node in df[u'nodes']}
        osd_df = df[u'nodes']
    return osd_df 

def parse_ceph_df(ceph_df_path, pools=[]):
    ceph_df = {}
    with open(ceph_df_path, 'r') as fh:
        df = json.loads(
        total_used = 0
        for pool in df[u'pools']:
            if len(pools) == 0 or pool[u'name'] in pools:
                total_used += pool[u'stats'][u'bytes_used']
        for pool in df[u'pools']:
            if len(pools) == 0 or pool[u'name'] in pools:
                ceph_df[pool[u'name']] = {
                    u'id': pool[u'id'],
                    u'weight': round(pool[u'stats'][u'bytes_used'] / float(total_used), 2),
    return ceph_df 

def set_initial_crush_weights(crushmap, osd_df, choose_total_tries=None):
    if choose_total_tries is not None:
        cmd = 'crushtool -i {} -o {} --set-choose-total-tries {}'.format(
                crushmap, crushmap, choose_total_tries)
        log('setting "choose_total_tries" tunable to {}'.format(choose_total_tries))
        subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)
    for osd in osd_df:
        new_osd_weight = round(osd[u'kb'] / float(1024*1024*1024), 5)
        cmd = 'crushtool -i {} -o {} --reweight-item osd.{} {}'.format(
                crushmap, crushmap, osd[u'id'], new_osd_weight)
        subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)

def prepare_final_crush_map(crushmap, weights):
    final_crushmap = 'offline_crush_{}'.format(time.time())
    shutil.copy(crushmap, final_crushmap)
    for osd in osd_df:
        new_osd_weight = round(weights[str(osd[u'id'])][u'crush_weight'], 5)
        cmd = 'crushtool -i {} -o {} --reweight-item osd.{} {}'.format(
                final_crushmap, final_crushmap, osd[u'id'], new_osd_weight)
        crushtool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)
    return final_crushmap

def map_pgs(osdmap, crushmap, pool_id=None):
    weights = {}
    pg_stats = {
        'min_pg': 100000000,
        'max_pg': 0,
        'mapped_min': 99,
        'mapped_max': 0,
    cmd = 'osdmaptool {} --import-crush {} --test-map-pgs-dump --mark-up-in --clear-temp 2>/dev/null'.format(
            osdmap, crushmap)
    if pool_id is not None:
        cmd += ' --pool {}'.format(pool_id)
    osdmaptool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)
    for line in osdmaptool.split('\n'):
        m = re.match(r'^osd\.(?P<id>\d+)\s+(?P<pg_count>\d+)\s+(?P<first>\d+)\s+(?P<primary>\d+)\s+(?P<crush_weight>\d+(\.\d+)?)\s+(?P<weight>\d+(\.\d+)?)$', line)
        k = re.match(r'^\s+avg\s+(?P<avg>\d+)\s+stddev\s+(?P<stddev>\d+(\.\d+)?)\s+', line)
        p = re.match(r'^\s*(?P<pg>[0-9a-fA-F]+\.[0-9a-fA-F]+)\s+\[(?P<upset>[\d,]+)\]\s+(?P<primary>\d+)\s*$', line)
        if m:
            weights['id')] = {
                'pg_count': float('pg_count')),
                'crush_weight': float('crush_weight')),
            if float('pg_count')) > pg_stats['max_pg']:
                pg_stats['max_pg'] = float('pg_count'))
            if float('pg_count')) < pg_stats['min_pg']:
                pg_stats['min_pg'] = float('pg_count'))
        elif k:
            pg_stats['avg']  = float('avg'))
            pg_stats['stddev'] = float('stddev'))
        elif p:
            size = len('upset').split(','))
            if pg_stats['mapped_min'] > size:
                pg_stats['mapped_min'] = size
            if pg_stats['mapped_max'] < size:
                pg_stats['mapped_max'] = size
    if pg_stats['mapped_min'] != pg_stats['mapped_max']:
        raise NotEnoughPGMemberError('unable to find enough OSD for some '
                'PG, pool_id == {}, mapped(min, max) == ({}, {})'.format(
                    pool_id, pg_stats['mapped_min'], pg_stats['mapped_max']))
    return (weights, pg_stats)

def update_crush_weights(crushmap, old_weights, avg_pg_count, change_step):
    stats = {
        'up': 0,
        'down': 0,
    for osd_id in old_weights:
        osd = old_weights[osd_id]
        if osd['pg_count'] < avg_pg_count:
            new_osd_weight = osd['crush_weight'] * (1.0 + change_step)
            cmd = 'crushtool -i {} -o {} --reweight-item osd.{} {}'.format(
                    crushmap, crushmap, osd_id, new_osd_weight)
            crushtool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)
            stats['up'] += 1
        elif osd['pg_count'] > avg_pg_count:
            new_osd_weight = osd['crush_weight'] * (1.0 - change_step)
            cmd = 'crushtool -i {} -o {} --reweight-item osd.{} {}'.format(
                    crushmap, crushmap, osd_id, new_osd_weight)
            crushtool = subprocess.check_output(cmd, shell=True, preexec_fn=os.setpgrp)
            stats['down'] += 1
    return stats

def find_optimal_osd_crush_weights_for_pool(ceph_df, osd_df, osdmap, crushmap, pool_id,
        target_stddev, max_rounds, pg_min_max_diff, change_step=0.005):
    global finished
    last_stddev = 999999
    round_no = 0
    ## prepare crushmap copy to operate on 
    tmp_crushmap = 'offline_crush_{}.tmp'.format(time.time())
    shutil.copy(crushmap, tmp_crushmap)
    while not finished and round_no < args.max_rounds:
        round_no += 1
        if round_no == 1:
            (weights, pg_stats) = map_pgs(osdmap, tmp_crushmap, pool_id=pool_id)
        if pg_stats['stddev'] <= target_stddev or (pg_stats['max_pg'] - pg_stats['min_pg']) <= args.pg_min_max_diff:
        ## change weight and update stats
        update_stats = update_crush_weights(tmp_crushmap, weights, pg_stats['avg'], change_step)
        (weights, pg_stats) = map_pgs(osdmap, tmp_crushmap, pool_id=pool_id)
        ## print progress info
        sys.stdout.write('\rpool: {:3.0f}, round: {:5.0f}, stddev: {:8.4f}, '
                'up: {:4.0f}, down: {:4.0f}, min_pg: {:4.0f}, max_pg: {:4.0f}'.
                format(pool_id, round_no, pg_stats['stddev'],
                    update_stats['up'], update_stats['down'],
                    pg_stats['min_pg'], pg_stats['max_pg']))
    ## clear progress line
    sys.stdout.write('\r{}\r'.format(' ' * 100))
    ## prepare final stats
    (weights, pg_stats) = map_pgs(osdmap, tmp_crushmap, pool_id=pool_id)
    return (weights, pg_stats)

def find_optimal_osd_crush_weights(ceph_df, osd_df, osdmap, crushmap, pools,
        target_stddev, max_rounds, pg_min_max_diff, change_step=0.005):
    weights = {}
    pg_stats = {}
    for pool in pools:
         pg_stats[pool]) = find_optimal_osd_crush_weights_for_pool(
                ceph_df=ceph_df, osd_df=osd_df, osdmap=osdmap,
                crushmap=crushmap, pool_id=ceph_df[pool][u'id'],
                target_stddev=args.target_stddev, max_rounds=args.max_rounds,
    weights[u'FINAL'] = {}
    for pool in pools:
        for osd_id in weights[pool]:
            weights[u'FINAL'].setdefault(osd_id, {
                    u'crush_weight': 0, u'pg_count': 0 })
            weights[u'FINAL'][osd_id][u'crush_weight'] += \
                weights[pool][osd_id][u'crush_weight'] * ceph_df[pool][u'weight']
            weights[u'FINAL'][osd_id][u'pg_count'] += \
    return (weights, pg_stats)

def exit_handler(signum, frame):
    global finished
    finished = True

if __name__ == '__main__':
    signal.signal(signal.SIGINT, exit_handler)
    parser = argparse.ArgumentParser()
    parser.add_argument('osd_df', help='path to osd df (json)',
            default=None, type=str)
    parser.add_argument('ceph_df', help='path to ceph df (json)',
            default=None, type=str)
    parser.add_argument('osdmap', help='path to osdmap (binary)',
            default=None, type=str)
    parser.add_argument('crushmap', help='path to crushmap (binary)',
            default=None, type=str)
    parser.add_argument('pools', help='pools used to calculate distribution',
            default=None, type=str, nargs='+')
    parser.add_argument('--target-stddev', help='target stddev',
            default=1.0, type=float)
    parser.add_argument('--initial-change-step', help='change step in %%',
            default=500, type=int)
    parser.add_argument('--max-rounds', help='max number of round',
            default=1000, type=int)
    parser.add_argument('--pg-min-max-diff', help='max acceptable difference beetween min_pg and max_pg',
            default=0, type=int)
            default=None, type=int,
            help='set choose_total_tries tunable, default do not change')
    args = parser.parse_args()
    osdmap = args.osdmap
    crushmap = args.crushmap
    pools = args.pools
    osd_df = parse_osd_df(args.osd_df)
    ceph_df = parse_ceph_df(args.ceph_df)
    set_initial_crush_weights(crushmap, osd_df, choose_total_tries=args.choose_total_tries)
    ## start calculations
    (weights, pg_stats) = find_optimal_osd_crush_weights(ceph_df=ceph_df, osd_df=osd_df,
            osdmap=osdmap, crushmap=crushmap, pools=pools,
            target_stddev=args.target_stddev, max_rounds=args.max_rounds,
    ## prepare final crushmap
    final_crushmap = prepare_final_crush_map(crushmap=crushmap,

    print '\nto apply run:\n\tceph osd setcrushmap -i {}\n'.format(final_crushmap)
Ceph-large mailing list

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFS]

  Powered by Linux