Re: Speeding up backfill after increasing PGs and or adding OSDs

David Turner <drakonstein@xxxxxxxxx> · Thu, 06 Jul 2017 15:08:20 +0000

Just a quick place to start is osd_max_backfills.  You have this set to 1.  Each PG is on 11 OSDs.  When you have a PG moving, it is on the original 11 OSDs and the new X number of OSDs that it is going to.  For each of your PGs that is moving, an OSD can only move 1 at a time (your osd_max_backfills), and each PG is on 11 + X OSDs.
So with your cluster.  I don't see how many OSDs you have, but you have 25 PGs moving around and 8 of them are actively backfilling.  Assuming you were only changing 1 OSD per backfill operation, that would mean that you had at least 96 OSDs (11+1 * 8).  That would be a perfect distribution of OSDs for the PGs backfilling.  Let's say now that you're averaging closer to 3 OSDs changing per PG and that the remaining 17 PGs waiting to backfill are blocked by a few OSDs each (because those OSDs are already included in the 8 active backfilling PGs.  That would indicate that you have closer to 200+ OSDs.

Every time I'm backfilling and want to speed things up, I watch iostat on some of my OSDs and increase osd_max_backfills until I'm consistently using about 70% of the disk to allow for customer overhead.  You can always figure out what's best for your use case though.  Generally I've been ok running with osd_max_backfills=5 without much problem and bringing that up some when I know that client IO will be minimal, but again it depends on your use case and cluster.

On Thu, Jul 6, 2017 at 10:08 AM <george.vasilakakos@xxxxxxxxxx> wrote:
Hey folks,

We have a cluster that's currently backfilling from increasing PG counts. We have tuned recovery and backfill way down as a "precaution" and would like to start tuning it to bring up to a good balance between that and client I/O.

At the moment we're in the process of bumping up PG numbers for pools serving production workloads. Said pools are EC 8+3.

It looks like we're having very low numbers of PGs backfilling as in:

            2567 TB used, 5062 TB / 7630 TB avail

            145588/849529410 objects degraded (0.017%)

            5177689/849529410 objects misplaced (0.609%)

                7309 active+clean

                  23 active+clean+scrubbing

                  18 active+clean+scrubbing+deep

                  13 active+remapped+backfill_wait

                   5 active+undersized+degraded+remapped+backfilling

                   4 active+undersized+degraded+remapped+backfill_wait

                   3 active+remapped+backfilling

                   1 active+clean+inconsistent

recovery io 1966 MB/s, 96 objects/s

  client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr

Also, the rate of recovery in terms of data and object throughput varies a lot, even with the number of PGs backfilling remaining constant.

Here's the config in the OSDs:

    "osd_max_backfills": "1",

    "osd_min_recovery_priority": "0",

    "osd_backfill_full_ratio": "0.85",

    "osd_backfill_retry_interval": "10",

    "osd_allow_recovery_below_min_size": "true",

    "osd_recovery_threads": "1",

    "osd_backfill_scan_min": "16",

    "osd_backfill_scan_max": "64",

    "osd_recovery_thread_timeout": "30",

    "osd_recovery_thread_suicide_timeout": "300",

    "osd_recovery_sleep": "0",

    "osd_recovery_delay_start": "0",

    "osd_recovery_max_active": "5",

    "osd_recovery_max_single_start": "1",

    "osd_recovery_max_chunk": "8388608",

    "osd_recovery_max_omap_entries_per_chunk": "64000",

    "osd_recovery_forget_lost_objects": "false",

    "osd_scrub_during_recovery": "false",

    "osd_kill_backfill_at": "0",

    "osd_debug_skip_full_check_in_backfill_reservation": "false",

    "osd_debug_reject_backfill_probability": "0",

    "osd_recovery_op_priority": "5",

    "osd_recovery_priority": "5",

    "osd_recovery_cost": "20971520",

    "osd_recovery_op_warn_multiple": "16",

What I'm looking for, first of all, is a better understanding of the mechanism that schedules the backfilling/recovery work; the end goal is to understand how to tune this safely to achieve as close to an optimal balance between rate at which recovery and client work is performed.

I'm thinking things like osd_max_backfills, osd_backfill_scan_min/osd_backfill_scan_max might be prime candidates for tuning.

Any thoughs/insights by the Ceph community will be greatly appreciated,

George

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com