On Fri, Aug 4, 2017 at 9:20 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Fri, 4 Aug 2017, Spandan Kumar Sahu wrote: >> On Thu, Aug 3, 2017 at 9:23 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> > I think the main value in the python-crush optimize code is that it >> > demonstrably works, which means we know that the cost/score fuction being >> > used and the descent method work together. I think the best path forward >> > is to look at the core of what those two pieces are doing and port it into >> > the balancer environment. Most recently I've been working on the >> > 'eval' method that will generate a score for a given distribution, but I'm >> > working from first principles (just calculating the layout, its deviation >> > from the target, the standard deviation, etc.) but I'm not sure what >> >> I had done some-work regarding assigning a score to the distribution >> at this[1] PR. It was however done in the pre-existing >> reweight-by-utilization. Would you give a look over it and let me >> know, if I should proceed to port it into the balancer module? > > This seems reasonable... I'm not sure we can really tell what the best > function is without trying it in combination with some optimization > method, though. > > I just pushed a semi-complete/working eval function in the wip-balancer > branch that uses a normalized standard deviation for pgs, objects, and > bytes. (Normalized meaning the standard deviation is divided by the > total count of pgs or objects or whatever so that it is unitless.) The > final score is just the average of those three values. Pretty sure that's > not the most sensible thing but its a start. FWIW I can do > > bin/init-ceph stop > MON=1 OSD=8 MDS=0 ../src/vstart.sh -d -n -x -l > bin/ceph osd pool create foo 64 > bin/ceph osd set-require-min-compat-client luminous > bin/ceph balancer mode upmap > bin/rados -p foo bench 10 write -b 4096 --no-cleanup > bin/ceph balancer eval > bin/ceph balancer optimize foo > bin/ceph balancer eval foo > bin/ceph balancer execute foo > bin/ceph balancer eval > > and the score goes from .02 to .001 (and pgs get balanced). > I have sent a PR for a better scoring method at https://github.com/liewegas/ceph/pull/59. Standard deviation is unbounded, and it depends significantly on the 'key' ('pg' or 'objects' or 'bytes'). This is not the case with the scoring method that I suggest. Plus, there are additional benefits like: 1. The new method can distinguish between the [ 5 overweighted + 1 heavily underweighted] vs [5 underweighted + 1 heavily overweighted]. (Discussed in previous mails) 2. It gives score in the similar range for all keys. 3. It takes into consideration of only the over-weighted devices. The need of such a scoring algorithm, has been mentioned in comments and commit message. >> > Loic's optimizer was doing. Also, my first attempt at a descent function >> > to correct weights was pretty broken, and I know a lot of experimentation >> > went into Loic's method. >> > >> >> Loic's optimizer only fixed defects in the crushmap, and was not (in >> the true sense) a reweight-by-utilization. >> In short, Loic's optimizer was optimizing a pool, on the basis of a >> rule, and then, ran a simulation to determine the new weights. Using >> the `take` in rules, it used to determine a list of OSDs, and move >> weights (about 1% of the overload%) from one OSD to another. This way, >> the weights of the buckets on the next hierarchical level in >> crush-tree wasn't affected. I went through the Loic's optimizer in >> details and also added my own improvisations. >> >> I will try to port the logic, but I am not sure, where would I fit the >> optimizer in? Would that go in as a separate function in module.py or >> would it have different implementations for each of upmaps, crush, >> crush-compat? Loic's python-crush didn't take upmaps into account. But >> the logic will apply in case of upmaps too. > > The 'crush-compat' mode in balancer is the one to target. There is a > partial implementation there that needs to be updated to use the new > framework; I'll fiddle with it a bit more to make it use the new Plan > approach (currently it makes changes to the changes to the cluster, > which doesn't work well!). For now the latest is at > > https://github.com/ceph/ceph/pull/16272 > > You can ignore the other modes (upmap etc) for now. Eventually we could > make it so that transitioning from one mode to another will somehow phase > out the old changes, but that's complicated and not needed yet. > >> > Do you see any problems with that approach, or things that the >> > balancer framework does not cover? >> > >> >> I was hoping that we have an optimizer that fixes the faults in >> crushmap, whenever, a crushmap is set and/or a new device get added or >> deleted. The current balancer would also fix it, but it would take >> much more time, and much more movement of data to achieve better >> distribution, compared to if we had fixed the crushmap itself, in the >> very beginning. Nevertheless, the balancer module, will eventually >> reach a reasonably good distribution. >> >> Correct me, if I am wrong. :) > > No, I think you're right. I don't expect that people will be importing > crush maps that often, though... and if they do they are hopefully clever > enough to do their own thing. The goal is for everything to be manageable > via the CLI or (better yet) simply handled automatically by the system. > > I think the main thing to worry about is the specific cases that people > are likely to encounter (and tend ot complain about), like adding new > devices and wanting the system to weight them in gradually. > > sage > > > >> >> [1]: https://github.com/ceph/ceph/pull/16361/files#diff-ecab4c883be988760d61a8a883ddc23fR4559 >> >> > Thanks! >> > sage >> > >> >> >> >> -- >> Spandan Kumar Sahu >> IIT Kharagpur >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- Regards Spandan Kumar Sahu -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html