Re: Backfill stops after a while after OSD reweight

Paul Emmerich <paul.emmerich@xxxxxxxx> · Thu, 21 Jun 2018 16:17:15 +0200

Your CRUSH rules will not change automatically.
Check out the documentation for changing tunables:

http://docs.ceph.com/docs/mimic/rados/operations/crush-map/#tunables

2018-06-20 18:27 GMT+02:00 Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>:
Thanks, Paul - I could probably activate the Jewel tunables

profile without losing too many clients - most are running

at least kernel 4.2, I think. I'll go hunting for older

clients ...

After changing the tunables, do I need to restart any

Ceph daemons?

Another question, if I may: The hammer tunables bring

CRUSH_V4 with straw2 buckets. Can I / should I convert

the existing buckets to straw2 somehow? Or will it

happen automatically?

Cheers,

Oliver

On 20.06.2018 18:10, Paul Emmerich wrote:

Yeah, your tunables are ancient. Probably wouldn't have happened with modern ones.

If this was my cluster I would probably update the clients and update that (caution: lots of data movement!),

but I know how annoying it can be to chase down everyone who runs ancient clients.

For comparison, this is what a fresh installation of Luminous looks like:

{

     "choose_local_tries": 0,

     "choose_local_fallback_tries": 0,

     "choose_total_tries": 50,

     "chooseleaf_descend_once": 1,

     "chooseleaf_vary_r": 1,

     "chooseleaf_stable": 1,

     "straw_calc_version": 1,

     "allowed_bucket_algs": 54,

     "profile": "jewel",

     "optimal_tunables": 1,

     "legacy_tunables": 0,

     "minimum_required_version": "jewel",

     "require_feature_tunables": 1,

     "require_feature_tunables2": 1,

     "has_v2_rules": 1,

     "require_feature_tunables3": 1,

     "has_v3_rules": 0,

     "has_v4_buckets": 1,

     "require_feature_tunables5": 1,

     "has_v5_rules": 0

}

For a work-around/fix, I'd probably either figure out which can be adjusted

without breaking the oldest clients. Incrementing choose*tries in the crush rule

or tunables is probably sufficient.

But since you are apparently running into data balance problems you'll have

to update that to something more modern sooner or later.

You can also play around with crushtool, it can simulate how PGs are mapped,

that's usually better than changing random things on a production cluster:

http://docs.ceph.com/docs/mimic/man/8/crushtool/

Good luck

Paul

2018-06-20 17:57 GMT+02:00 Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx <mailto:oliver.schulz@tu-dortmund.de>>:

    Hi Paul,

    ah, right, "ceph pg dump | grep remapped", that's what I was looking

    for. I added the output and the result of the pg query at the end of

    https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb

    <https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb>

    > But my guess here is that you are running a CRUSH rule to distribute across 3 racks

    > and you only have 3 racks in total.

    Yes - I always assumed that 3 failure domains would be suitable

    for replication factor of 3. The three racks are absolutely

    identical, though, hardware-wise, including HDD sizes, and we

    never had any trouble like this before Luminous (we often used

    significant reweighting in the past).

    We are way behind on Ceph tunables though:

    # ceph osd crush show-tunables

    {

         "choose_local_tries": 0,

         "choose_local_fallback_tries": 0,

         "choose_total_tries": 50,

         "chooseleaf_descend_once": 1,

         "chooseleaf_vary_r": 0,

         "chooseleaf_stable": 0,

         "straw_calc_version": 1,

         "allowed_bucket_algs": 22,

         "profile": "bobtail",

         "optimal_tunables": 0,

         "legacy_tunables": 0,

         "minimum_required_version": "bobtail",

         "require_feature_tunables": 1,

         "require_feature_tunables2": 1,

         "has_v2_rules": 0,

         "require_feature_tunables3": 0,

         "has_v3_rules": 0,

         "has_v4_buckets": 0,

         "require_feature_tunables5": 0,

         "has_v5_rules": 0

    }

    We still have some old clients (trying to get rid of those, so I

    can activate more recent tunables, but it may be a while) ...

    Are my tunables at fault? If so, can you recommend a solution

    or a temporary workaround?

    Cheers (and thanks for helping!),

    Oliver

    On 06/20/2018 05:01 PM, Paul Emmerich wrote:

        Hi,

        have a look at "ceph pg dump" to see which ones are stuck in

        remapped.

        But my guess here is that you are running a CRUSH rule to

        distribute across 3 racks

        and you only have 3 racks in total.

        CRUSH will sometimes fail to find a mapping in this scenario.

        There are a few parameters

        that you can tune in your CRUSH rule to increase the number of

        retries.

        For example, the settings set_chooseleaf_tries and

        set_choose_tries can help, they are

        set by default for erasure coding rules (where this scenario is

        more common). Values used

        for EC are set_chooseleaf_tries = 5 and set_choose_tries = 100.

        You can configure them by adding them as the first steps of the

        rule.

        You can also configure an upmap exception.

        But in general it is often not the best idea to have only 3

        racks for replica = 3 if you want

        to achieve a good data balance.

        Paul

        2018-06-20 16:50 GMT+02:00 Oliver Schulz

        <oliver.schulz@xxxxxxxxxxxxxx

        <mailto:oliver.schulz@tu-dortmund.de>

        <mailto:oliver.schulz@tu-dortmund.de

        <mailto:oliver.schulz@tu-dortmund.de>>>:

             Dear Paul,

             thanks, here goes (output of "ceph -s", etc.):

        https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb

        <https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb>

        <https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb <https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb>>

             > Also please run "ceph pg X.YZ query" on one of the PGs

        not backfilling.

             Silly question: How do I get a list of the PGs not backfilling?

             On 06/20/2018 04:00 PM, Paul Emmerich wrote:

                 Can you post the full output of "ceph -s", "ceph health

        detail, and ceph osd df tree

                 Also please run "ceph pg X.YZ query" on one of the PGs

        not backfilling.

                 Paul

                 2018-06-20 15:25 GMT+02:00 Oliver Schulz

        <oliver.schulz@xxxxxxxxxxxxxx

        <mailto:oliver.schulz@tu-dortmund.de>

        <mailto:oliver.schulz@tu-dortmund.de

        <mailto:oliver.schulz@tu-dortmund.de>>

        <mailto:oliver.schulz@tu-dortmund.de

        <mailto:oliver.schulz@tu-dortmund.de>

        <mailto:oliver.schulz@tu-dortmund.de

        <mailto:oliver.schulz@tu-dortmund.de>>>>:

                      Dear all,

                      we (somewhat) recently extended our Ceph cluster,

                      and updated it to Luminous. By now, the fill level

                      on some ODSs is quite high again, so I'd like to

                      re-balance via "OSD reweight".

                      I'm running into the following problem, however:

                      Not matter what I do (reweigt a little, or a lot,

                      or only reweight a single OSD by 5%) - after a

                      while, backfilling simply stops and lots of objects

                      stay misplaced.

                      I do have up to 250 PGs per OSD (early sins from

                      the first days of the cluster), but I've set

                      "mon_max_pg_per_osd = 400" and

                      "osd_max_pg_per_osd_hard_ratio = 1.5" to compensate.

                      How can I find out why backfill stops? Any advice

                      would be very much appreciated.

                      Cheers,

                      Oliver

                      _______________________________________________

                      ceph-users mailing list

        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>

        <mailto:ceph-users@xxxxxxxxxx.com

        <mailto:ceph-users@xxxxxxxxxx.com>>

        <mailto:ceph-users@xxxxxxxxxx.com

        <mailto:ceph-users@xxxxxxxxxx.com>

        <mailto:ceph-users@xxxxxxxxxx.com

        <mailto:ceph-users@xxxxxxxxxx.com>>>

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>

        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

                 <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>>

                 --         Paul Emmerich

                 Looking for help with your Ceph cluster? Contact us at

        https://croit.io

                 croit GmbH

                 Freseniusstr. 31h

                 81247 München

        www.croit.io <http://www.croit.io> <http://www.croit.io>

        <http://www.croit.io>

                 Tel: +49 89 1896585 90

        --         Paul Emmerich

        Looking for help with your Ceph cluster? Contact us at

        https://croit.io

        croit GmbH

        Freseniusstr. 31h

        81247 München

        www.croit.io <http://www.croit.io> <http://www.croit.io>

        Tel: +49 89 1896585 90

-- 

Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH

Freseniusstr. 31h

81247 München

www.croit.io <http://www.croit.io>

Tel: +49 89 1896585 90

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com