Re: global backfill reservation?

"LIU, Fei" <james.liu@xxxxxxxxxxxxxxx> · Sat, 03 Jun 2017 05:44:33 +0800

Agree with what Ning said in terms of Ceph cluster’s user expectation. The recovery/backfill even scrub should be scheduled dynamically based on SLA and cluster resources.

Regards,
James

On 5/20/17, 7:24 AM, "Ning Yao" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of zay11022@xxxxxxxxx> wrote:

    I think the most efficient way to solve this problem is not to
    restrict the number of backfilling pgs.  The reason why they want to
    reduce backfilling pgs at the same time is because this is the only
    thing we can do in Ceph currently. As David mentioned above, reducing
    the active backfilling pgs at a time will increase the total recovery
    time, which in turn leads to lower reliability and increase the data
    loss probability.

    Actually, for end-users, they do not care what happens in the ceph
    backend. They wanna if there is enough bandwidth, then recover my data
    as fast as possible. But at the same time, they want the user IO is
    served first. That means if the cluster has 10GB/s, 100k iops IO
    bandwidth, at night, user IO cost 20% bandwidth so that 80% bandwidth
    for recovery, while at daytime, user IO cost 80% bandwidth  so that
    20% bandwidth for recovery. so it seems pretty reasonable to do it
    with dynamic QoS strategy and serve the user IO first at anytime. Only
    in this way, it can achieve the final goal for this issue.

    Therefore
    Regards
    Ning Yao

    2017-05-13 2:53 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
    > A common complaint is that recovery/backfill/rebalancing has a high
    > impact.  That isn't news.  What I realized this week after hearing more
    > operators describe their workaround is that everybody's workaround is
    > roughly the same: make small changes to the crush map so that only a small
    > number of PGs are backfilling at a time.  In retrospect it seems obvious,
    > but the problem is that our backfill throttling is per-OSD: the "slowest"
    > we can go is 1 backfilling PG per OSD.  (Actually, 2.. one primary and one
    > replica due to separate reservation thresholds to avoid deadlock.)  That
    > means that every OSD is impacted.  Doing fewer PGs doesn't make the
    > recovery vs client scheduling better, but it means it affects fewer PGs
    > and fewer client IOs and the net observed impact is smaller.
    >
    > Anyway, in short, I think we need to be able to set a *global* threshold
    > of "no more than X % of OSDs should be backfilling at a time," which is
    > impossible given the current reservation appoach.
    >
    > This could be done naively by having OSDs reserve a slot via the mon or
    > mgr.  If we only did it for backfill the impact should be minimal (those
    > are big slow long-running operations already).
    >
    > I think you can *almost* do it cleverly by inferring the set of PGs that
    > have to backfill by pg_temp.  However, that doesn't take any priority or
    > stuck PGs into consideration.
    >
    > Anyway, the naive thing probably isn't so bad...
    >
    > 1) PGMap counts backfilling PGs per OSD (and then the number of OSDs with
    > one or more backfilling PGs).
    >
    > 2) For the first step of the backfill (recovery?) reservation, OSDs ask
    > the mgr for a reservation slot.  The reservation is (pgid,interval epoch)
    > so that the mgr can throw out the reservation require without needing an
    > explicit cancellation if there is an interval change.
    >
    > 3) mgr grants as many reservations as it can without (backfilling +
    > grants) > whatever the max is.
    >
    > We can set the max with a global tunable like
    >
    >  max_osd_backfilling_ratio = .3
    >
    > so that only 30% of the osds can be backfilling at once?
    >
    > sage
    > --
    > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    > the body of a message to majordomo@xxxxxxxxxxxxxxx
    > More majordomo info at  http://vger.kernel.org/majordomo-info.html
    --
    To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    the body of a message to majordomo@xxxxxxxxxxxxxxx
    More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html