Re: crush optimization targets

Loic Dachary <loic@xxxxxxxxxxx> · Thu, 11 May 2017 08:41:47 +0200

Hi Xavier,

On 05/11/2017 07:52 AM, Xavier Villaneau wrote:
> Hello Sage,
> 
> On 05/10/2017 at 09:00 AM, Sage Weil wrote:
>> I *think* this is the simplest way to approach this (at least currently),
>> although it is not in fact perfect.  We basically assume that the crush
>> rules the admin has set up "make sense."  For example, if you have two
>> crush rules that target a subset of the hierarchy, but they are
>> overlapping (e.g., one is a subset of the other), then the subset that is
>> covered by both will get utilized by the sum of the two and have a higher
>> utilization--and the optimizer will not care (in fact, it will expect it).
>>
>> That rules out at least one potential use-case, though: say you have a
>> pool and rule defined that target a single host.  Depending on how much
>> you store in that pool, those devices will be that much more utilized.
>> One could imagine wanting Ceph to automatically monitor that pool's
>> utilization (directly or indirectly) and push other pools' data out of
>> those devices as the host-local pool fills.  I don't really like this
>> scenario, though, so I can't tell if it is a "valid" one we should care
>> about.
> It looks like upmap calculation works by counting placement groups, so "weird" maps and rules are mostly a problem if the overlapping pools have different ratios of bytes per PG. Maybe that data could be used in the algorithm, but I don't know if the added complexity would be worth it. At this point, it is probably fair to think those corner cases are only seen on maps created by knowledgeable users.
> 
>> In any case, my hope is that at the end of the day we have a suite of
>> opimization mechanisms: crush weights via the new choose_args, pg-upmap,
>> and (if we don't deprecate it entirely) the osd reweight; and pg-based or
>> osd utilization-based optimization (expected usage vs actual usage, or
>> however you want to put it).  Ideally, they could use a common setup
>> framework that handles the calculation of what the expected/optimal
>> targets we're optimizing against (using something like the above) so that
>> it isn't reinvented/reimplemented (or, more likely, not!) for each one.
> I like pg-upmap, it looks great for fine control and has a small impact (compared to manipulating weights, which could move thousands of PGs around). This means it is ideal for re-balancing running clusters.

The weight manipulation is done by small increments. I believe each increment can be applied individually to throttle the PGs movements (i.e. a crushmap with slightly modified weights can be produced & uploaded at each step). It's not as fine grain as pg-upmap but it does not need to be an all-or-nothing optimization either.

> It also addresses the "weight spike" issue (that 5 1 1 1 1 case) since it can just move placement groups until it runs of options. This allows the theoretical limit cases to be reached eventually, whereas with weights it's only asymptotic.
> 
> Hopefully that won't be too confusing; reweight is already quite difficult to explain to new users, so pg-upmap will probably be too. There's also the question of how those tools interact with each other, for instance if PG-based optimization is run on top of utilization-based optimization.
> 
> (Hopefully I did not say anything too wrong, I haven't been able to follow those developments closely for the past couple of weeks),
> 
> Regards,

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html