David Brownell wrote: > Interesting. I start to like this shape better; it moves more of the > logic to operating point code, where it can make the sysfs interface > talk in terms of meaningful abstractions, not cryptic numeric offsets. > But it was odd to see the first patch be platform-specific support, > rather than be a neutral framework into which platform-aware code plugs > different kinds of things... Since it is at a low layer below a number of possible interfaces, and since there is no generic processing performed at this low layer (it's pretty much set or get an opaque structure), there isn't any higher-layer framework to plug into at the moment. If something like these abstractions of power parameters and operating points are felt to be a good foundation for a runtime power management stack then turning our attentions to the next layer up (perhaps cpufreq or a new embedded-oriented stack) would create that generic structure. Its worth noting that newer embedded SOCs are coming up with such complicated clocking structures and rules for setting and switching operating points that some silicon vendors are starting to provide code at approximately the PowerOP level for their platforms, to plug into different upper-layer power management stacks (and possibly different open source OSes). So there may be some value to settling on common interfaces for this. > One part I don't like is that the platform would be limited to tweaking > a predefined set of fields in registers. That seems insufficient for > subsystems that may not be present on all boards. Yes, the code currently assumes it would be tweaked for different variants of platforms, partly due to the difficulty of implementing a lean and mean way of integrating the different pieces. It sounds like registering multiple handlers for multiple sets of power parameters may be in order, although a single opaque structure shared between upper layers and the handlers probably won't be sufficient any more. If the operating point data structure basically goes away and sysfs becomes the preferred interface then it should be fairly straightforward to discover what PM capabilities are registered and to get/set the associated power param attributes. Otherwise in-kernel interfaces might need some further thought to specify something that routes to the proper handler. > Plus, to borrow some > terms from cpufreq, it only facilitates "usermode" governor models, never > "ondemand" or any other efficient quick-response adaptive algorithms. The sysfs interface does not itself handle such schemes, but the PowerOP layer is fine with inserting beneath in-kernel algorithms. Low-latency, very frequent adjustments to power parameters are very much in mind for what I'm trying to do, assuming embedded hardware will increasingly be able to take advantage of aggressive runtime power management for battery savings. (Much of this is driven by how embedded hardware can most aggressively but usefully be power managed, and it would be nice to get those folks more involved.) What DPM does with approximately the same type of interface is setup some operating points and policies for which operating point is appropriate in which situations, and then kick off a kernel state machine that handles the transitions. ... > Alternatively, the "thing" could implement some adaptive algorithm > using local measurements, predictions, and feedback to adjust any > platform power parameters dynamically. Maybe it'd delegate management > of the ARM clock to "cpufreq", and focus on managing power for other > board components that might never get really reusable code. Switching > between operating points wouldn't require userspace instruction; > call it a "dynamic operating point" selection model. Interesting, although such close coordination of changing various clocks and voltages is required on some platforms that it would be hard to distribute it much among kernel components. To some degree the above is how DPM functions: some policy instructions are sent to the kernel and the kernel switches operating points accordingly. Something more flexible than operating points could be specified in the policy info, possibly even something as abstract as "battery low", pushing the interpretation of high-level power policy into kernel components instead of a userspace app giving the kernel low-level instructions. > The DSP clock might benefit from some support though. I've never > much looked at this, beyond noting that SPUs on CELL should have > similar issues. Wouldn't it be nice to have "ondemand" style > governors for DSPs or SPUs? That's got to be easy. ;) So far as I understand, Linux-coordinated power management of the DSP side of dual-core general-purpose + DSP platforms is often handled by a Linux driver that knows how to talk to whatever it is that runs on the DSP (such as via shared memory message libs from the silicon vendor). Soon the other core will be running Linux as well, and the two OSes will need to coordinate the system power management, which will be an interesting thing to tackle. >> lowpwr 1 = assert ULPD LOW_PWR, voltage scale low > > > Could you describe the policy effect of this bit? I suspect > a good "PCs don't work like that!" example is lurking here. > That interacts with some other bits, and code ... when would > setting this be good, or bad? This is how Dynamic Voltage Scaling is done on OMAP1 platforms. Assuming you've setup an operating point that is validated to work at the reduced voltage level on your hardware by TI (these are two voltage levels available), you can optionally specify to run at reduced voltage, possibly at an increased cost in latency of transitioning between operating points as voltage ramps up or down. In the case of DPM running on an OMAP 1610 H2, you could tell the system to run at 1.5V when not idle and at 1.1V when idle, although depending on the ramp time (I can't recall for that board, but for some non-OMAPs this can be significant) and the realtime constraints of your app there could be missed deadlines under such a policy. If the system isn't running anything with a tight deadline then it may be fine to stay at 1.1V or voltage scale between the two. >>Other parameters such as DSPMMUDIV, LCDDIV, and ARM_PERDIV might also be >>useful. > > > Again, PERDIV changes would need to involve clock.c to cascade > the changes through the clock tree. Change PERDIV and you'll > need to recalculate the peripheral clocks that derive from it... > better not do it while an I/O operation is actively using it! On some other platforms this actually becomes necessary, but for OMAP1's the trouble with doing so probably precludes anybody from using it. > As with TCDIV, that makes a useful example of something that is > clearly not within the "cpufreq" domain. I'll try to cook up an XScale PXA27x example, which adds multiple memory and system bus frequencies supported per CPU MHz, quick run vs.turbo mode switching of CPU MHz, and some other exotic features. It has a very specific set of "product points" validated by Intel that correspond to the operating point abstraction. If nothing else, it may be instructive to consider the variety of ways embedded platforms are being designed to be power managed. -- Todd