Re: throttling reweight-by-utilization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 18, 2016 at 3:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Mon, 18 Jan 2016, Dan van der Ster wrote:
>> Hi,
>>
>> I'd like to propose a few changes to reweight-by-utilization which
>> will make it significantly less scary:
>>
>> 1. Change reweight-by-utilization to run in "dry run" -- display only
>> -- mode unless an admin runs with  --yes-i-really-really-mean-it. This
>> way admins can see what will be reweighted before committing to any
>> changes.
>
> I think this piece is key, and there is a lot we might do here to make
> this more informative.  In particular, we have the (approx) sizes of each
> PG(*) and can calculate their mapping after the proposed change, which
> means we could show the min/max utilization, standard deviation, and/or
> number of nearfull or full OSDs before and after.
>

I hadn't thought of that, but it would be cool.
My main use of the dry-run functionality is to ensure that it's not
going to change too many OSDs at once (IOW letting me try different
oload and pool values). Maybe some users want to make a large change
all in one go -- in that case this would be useful.


> * Almost... we don't really know how many bytes of key/value omap data are
> consumed.  So we could either go by the user data accounting, which is a
> lower bound, or average the OSD utilization by the PGs it stores
> (averaging pools together), or try do the same for just the difference
> (which would presumably be omap data + overall overhead).
>
> I'm not sure how much it is worth trying to be accurate here...
>
>> 2. Add a configurable to limit the number of OSDs changed per execution:
>>   mon_reweight_max_osds_changed (default 4)
>>
>> 3. Add a configurable to limit the weight changed per OSD:
>>   mon_reweight_max_weight_change (default 0.05)
>>
>> Along with (2) and (3), the main loop in reweight_by_utilization:
>>   https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L568
>> needs to sort the OSDs by utilization.
>>
>> 4. Make adjusting weights up optional with a new CLI option
>> --adjust-up. This is useful because if you have nearly full OSDs you
>> want to prioritize making space on those OSDs.
>
> These sound reasonable to me.  Although, in general, if we ultimately want
> people do to this regularly via cron or something, we'll need --adjust-up.
> I wonder if there is some other way it should be biased so that we weight
> the overfull stuff down before weighting the underfull stuff up.  Maybe
> the max_osds_changed already mostly does that by doing the fullest osds
> first?

One other reason we need to be able to disable --adjust-up is for
non-flat crush trees, where some OSDs get more PGs because of a non
trivial ruleset. reweight-by-pg helps in this case, but its still not
perfect for all crush layouts. For example, we have a ruleset which
puts two replicas in one root and a third replica in another root...
this is difficult for the reweight-by-* function to get right. One
solution to this which I haven't yet tried is to add a --bucket option
to reweight-by-* which only looks at OSDs under a given point in the
tree.

-- Dan



> Thanks, Dan!
> sage
>
>
>
>> I have already been running with these options in a python prototype:
>>   https://github.com/cernceph/ceph-scripts/blob/master/tools/crush-reweight-by-utilization.py
>>
>> If you agree I'll port these changes to OSDMonitor.cc and send a PR.
>>
>> Best Regards,
>> Dan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux