Re: global backfill reservation?

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Sat, 3 Jun 2017 09:51:32 +0200

On Fri, Jun 2, 2017 at 4:05 PM, Peter Maloney
<peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
> On 05/13/17 18:55, Dan van der Ster wrote:
>> +1, this is something I've wanted for awhile. Using my "gentle
>> reweight" scripts, I've found that backfilling stays pretty
>> transparent as long as we limit to <5% of OSDs backfilling on our
>> large clusters. I think it will take some experimentation to find the
>> best default ratio to ship.
>>
>> On the other hand, the *other* reason that we operators like to make
>> small changes is to limit the number of PGs that go through peering
>> all at once. Correct me if I'm wrong, but as an operator I'd hesitate
>> to trigger a re-peering of *all* PGs in an active pool -- users would
>> surely notice such an operation. Does luminous or luminous++ have some
>> improvements to this half of the problem?
>>
>> Cheers, Dan
>>
>
> Hi Dan,
>
> I have read your script:
> https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight#L42
>
> And at that line I see you using "ceph osd crush reweight" instead of
> "ceph osd reweight".
>
> And I just added 2 nodes to my cluster and had some related issues and
> solved them. Doing it like your script, crush reweighting a tiny bit a
> time causes blocked requests for long durations, even just moving 1 pg
> ... I let one go for 40s before stopping it. It seemed impossible to
> ever get one pg to peer without such a long block. I also tried making a
> special pool with those 12 osds to test and it took 1 minute to make 64
> pgs without any clients using them, which is still unreasonable for a
> blocked request. (Also the "normal" way to just blindly add osds with
> full weight and not take any special care would just do the same in one
> big jump instead of many.)
>
> And the solution in the end was quite painless... have osds up (with
> either weight 0), then just set reweight 0, crush weight normal (TB) and
> then it does peering (one sort of peering?) and then after peering is
> done, change "ceph osd reweight", even a bunch at once and it has barely
> any impact... it does peering (the other sort of peering, not repeating
> the slow terrible sort it did already?), but very fast and with only a
> few 5s blocked requests (which is fairly normal here due to rbd
> snapshots). Maybe the crush weight peering with 0 reweight makes it do
> the slow terrible sort of peering, but without blocking any real pgs,
> and therefore without blocking clients, so it's tolerable (blocking
> empty osds, not used pools and pgs). And then the other peering is fast.
>
> And Sage, if that's true, then couldn't ceph by default just do the
> first kind of peering work before any pgs, pools, clients, etc. are
> affected, before moving on to the stuff that affects clients, regardless
> of which steps were used? At some point during adding t hose 2 nodes I
> was thinking how could ceph be so broken and mysterious... why does it
> just hang there? Would it do this during recovery of a dead osd too? Now
> I know how to avoid it and that it shouldn't affect recovering dead osds
> (not changing crush weight)... but it would be nice for all users not to
> ever think that way. :)
>
> And Dan, I am curious about why you use crush reweight for this (which I
> failed to), and whether you tried it the way I describe above, or
> another way.
>
> And I'm using jewel 10.2.7. I don't know how other versions behave.
>
>

Here's what we do:
  1. Create and start new OSDs with initial crush weight = 0.0. No PGs
should re-peer when these are booted.
  2. Run the reweight script, e.g. like this for some 6T drives:

   ceph-gentle-reweight -o osd.10,osd.11,osd.12 -l 15 -b 50 -d 0.01 -t 5.46

In practice we've added >150 drives at once with that script -- using
that tiny delta.

We use crush reweight because it "works for us (tm)". We haven't seen
any strange peering hangs, though we exercise this on hammer, not
(yet) jewel.
I hadn't thought of your method using osd reweight -- how do you add
new osds with an initial osd reweight? Maybe you create the osds in a
non-default root then move them after being reweighted to 0.0?

Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html