On Feb 21, 2012, at 2:56 PM, Peter Zijlstra wrote: > On Tue, 2012-02-21 at 14:38 +0200, Pantelis Antoniou wrote: >> >> If we go to all the trouble of integrating cpufreq/cpuidle/sched into scheduler >> callbacks, we should place hooks into the thermal framework/PM as well. >> >> It will pretty common to have per core temperature readings, on most >> modern SoCs. >> >> It is quite conceivable to have a case with a multi-core CPU where due >> to load imbalance, one (or more) of the cores is running at full speed >> while the rest are mostly idle. What you want do, for best performance >> and conceivably better power consumption, is not to throttle either >> frequency or lowers voltage to the overloaded CPU but to migrate the >> load to one of the cooler CPUs. >> >> This affects CPU capacity immediately, i.e. you shouldn't schedule more >> load on a CPU that its too hot, since you'll only end up triggering thermal >> shutdown. The ideal solution would be to round robin >> the load from the hot CPU to the cooler ones, but not so fast that we lose >> due to the migration of state from one CPU to the other. >> >> In a nutshell, the processing capacity of a core is not static, i.e. it >> might degrade over time due to the increase of temperature caused by the >> previous load. >> >> What do you think? > > This is called core-hopping, and yes that's a nice goal, although I > would like to do that after we get the 'simple' bits up and running. I > suspect it'll end up being slightly more complex than we'd like to due > to the fact that the goal conflicts with wanting to aggregate things on > cpu0 due to cpu0 being special for a host of reasons. > > Hi Peter, Agreed. We need to get there step by step, and I think that per-task load tracking is the first one. We do have other metrics besides load that can influence the scheduler decisions, with the most obvious being power consumption. BTW, since we're going to the trouble of calculating per-task load with increased accuracy, how about having some thought of translating the load numbers in an absolute format. I.e. with the CPUs now having fluctuating performance (due to cpufreq etc.) one would say that each CPU would have an X bogomips (or some else absolute) capacity per OPP. Perhaps having such a bogomips number calculated per-task would make things easier. Perhaps the same can be done with power/energy, i.e. have a per-task power consumption figure that we can use for scheduling, according to the available power budget per CPU. Dunno, it might not be feasible ATM, but having a power-aware scheduler would assume some kind of power measurement, no? Regards -- Pantelis -- To unsubscribe from this list: send the line "unsubscribe cpufreq" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html