On Fri, Jun 2, 2017 at 4:05 PM, Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote: > On 05/13/17 18:55, Dan van der Ster wrote: >> +1, this is something I've wanted for awhile. Using my "gentle >> reweight" scripts, I've found that backfilling stays pretty >> transparent as long as we limit to <5% of OSDs backfilling on our >> large clusters. I think it will take some experimentation to find the >> best default ratio to ship. >> >> On the other hand, the *other* reason that we operators like to make >> small changes is to limit the number of PGs that go through peering >> all at once. Correct me if I'm wrong, but as an operator I'd hesitate >> to trigger a re-peering of *all* PGs in an active pool -- users would >> surely notice such an operation. Does luminous or luminous++ have some >> improvements to this half of the problem? >> >> Cheers, Dan >> > > Hi Dan, > > I have read your script: > https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight#L42 > > And at that line I see you using "ceph osd crush reweight" instead of > "ceph osd reweight". > > And I just added 2 nodes to my cluster and had some related issues and > solved them. Doing it like your script, crush reweighting a tiny bit a > time causes blocked requests for long durations, even just moving 1 pg > ... I let one go for 40s before stopping it. It seemed impossible to > ever get one pg to peer without such a long block. I also tried making a > special pool with those 12 osds to test and it took 1 minute to make 64 > pgs without any clients using them, which is still unreasonable for a > blocked request. (Also the "normal" way to just blindly add osds with > full weight and not take any special care would just do the same in one > big jump instead of many.) > > And the solution in the end was quite painless... have osds up (with > either weight 0), then just set reweight 0, crush weight normal (TB) and > then it does peering (one sort of peering?) and then after peering is > done, change "ceph osd reweight", even a bunch at once and it has barely > any impact... it does peering (the other sort of peering, not repeating > the slow terrible sort it did already?), but very fast and with only a > few 5s blocked requests (which is fairly normal here due to rbd > snapshots). Maybe the crush weight peering with 0 reweight makes it do > the slow terrible sort of peering, but without blocking any real pgs, > and therefore without blocking clients, so it's tolerable (blocking > empty osds, not used pools and pgs). And then the other peering is fast. > > And Sage, if that's true, then couldn't ceph by default just do the > first kind of peering work before any pgs, pools, clients, etc. are > affected, before moving on to the stuff that affects clients, regardless > of which steps were used? At some point during adding t hose 2 nodes I > was thinking how could ceph be so broken and mysterious... why does it > just hang there? Would it do this during recovery of a dead osd too? Now > I know how to avoid it and that it shouldn't affect recovering dead osds > (not changing crush weight)... but it would be nice for all users not to > ever think that way. :) > > And Dan, I am curious about why you use crush reweight for this (which I > failed to), and whether you tried it the way I describe above, or > another way. > > And I'm using jewel 10.2.7. I don't know how other versions behave. > > Here's what we do: 1. Create and start new OSDs with initial crush weight = 0.0. No PGs should re-peer when these are booted. 2. Run the reweight script, e.g. like this for some 6T drives: ceph-gentle-reweight -o osd.10,osd.11,osd.12 -l 15 -b 50 -d 0.01 -t 5.46 In practice we've added >150 drives at once with that script -- using that tiny delta. We use crush reweight because it "works for us (tm)". We haven't seen any strange peering hangs, though we exercise this on hammer, not (yet) jewel. I hadn't thought of your method using osd reweight -- how do you add new osds with an initial osd reweight? Maybe you create the osds in a non-default root then move them after being reweighted to 0.0? Cheers, Dan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html