Now that 2.6.11 is out, we can start to address a number of important issues for power management. This overview discusses some of them, the things I have run across in my own work. Up to now, the PM development effort has been concerned primarily with system-wide sleep transitions, things like Suspend-To-RAM (STR) and Suspend-To-Disk (STD). (A more general, less PC-centric description would call these states "deep sleep" and "shallow sleep". A third possible state, which some people might be in favor of, is Standby or "very shallow sleep".) The important thing here is that these involve global transitions, affecting every device in the system. Also, they don't involve many policy decisions (which clocks to turn off and so on), and in any case such decisions are outside the scope of the PM core. The implementation of sleep transitions has gone reasonably smoothly, and people generally understand the problems and what still needs to be done. There's one thing I'd like to mention: the refrigerator. Ben H. has said that putting processes in the refrigerator at the start of a sleep transition is an unnecessary waste of time and I tend to agree, with a couple of provisos: In SMP systems, things would quickly get very confusing if more the one processor was actively running during a sleep transition. I don't know how this is handled currently, but presumably it's not a major problem. In STD, it's important that most processes do not run at all once the memory image has been captured. In particular they must not run while the image is being stored on the disk. But sometimes a few processes have to be allowed to run (e.g., those needed for performing disk I/O). The scheduler must be told that only processes marked PF_NOFREEZE can be allowed to run. If those two things are handled correctly, I don't see any need for the refrigerator. Maybe someone can point out reasons that I haven't thought of. Now it's time to consider how to implement additional power-saving measures -- in other words, selective suspends. Support currently is minimal and there are several important matters to settle. Typically selective suspend comes in two forms: drivers automatically reducing power usage by a device after a period of inactivity (let's call this "auto suspend"), and userspace-initiated changes carried out by writing to /sys/device/.../power/state ("user suspend"). They are rather different from system-wide sleep transitions, and each has its own set of problems. A common problem for all selective suspends is that, unlike system sleeps, they can occur at any time. Drivers will get very confused unless we can guarantee somehow that, at a minimum, they will not receive a suspend or resume call for a device while its probe or release routine is running. Some general sort of mutual exclusion is needed, something more than just dpm_sem. (Individual subsystems may have even stronger requirements for mutual exclusion. For example, the USB drivers require that changing device configurations and performing port resets are exclusive with each other and both suspend/resume and probe/release.) I rather think that the need for this will be so widespread that it deserves to be integrated into the driver model core. An important difference between system sleep and selective suspend is that with selective suspend, we generally expect the device to resume on demand. This demand may take the form of a request to the driver (e.g., a block I/O request for a disk device) or a resume request from the device itself (e.g., a notification from a mouse that has just been moved). This means that input queues must not be plugged and device interrupts must remain enabled, exactly the opposite of what happens during system sleep. For this reason it is vital for drivers to know whether a suspend call is invoking a system sleep or a selective suspend. Hence I propose that a new pm_message_t event code, PMSG_SELECTIVE (or maybe PMSG_SELECTIVE_SUSPEND), be used for selective suspends. With resume-on-demand implemented properly, a driver may decide that it can suspend its device without bothering to suspend the device's children. This kind of decision should be left to individual drivers and the PM core shouldn't try to enforce a "children must be suspended before their parents" policy for selective suspends. A common problem facing all drivers that do auto suspend is how to set the inactivity timeout. Two possible answers are: add an attribute file in the /sys/.../power directory (so different devices can have different timeouts), or add a driver module parameter (so all devices using the same driver will have the same timeout). The module parameter approach is more efficient, but it suffers from the drawback that a driver is not notified when a parameter is changed! So how should we handle the situation where a user decreases the timeout value? The case where the timeout value is _increased_ poses no problem; when the original timer expires the driver can figure out that it's not yet time to do the suspend. For user suspends (made through sysfs) the user may want to convey arbitrary information to a driver, things like which clocks to turn off, which power level to change to, and so on. This information will vary from driver to driver, and the PM core shouldn't even try to impose any sort of structure on it. I think the best approach will be to pass to the driver a character pointer giving the data written to /sys/.../power/state, so that users can send whatever they want just by writing it to the file. This means adding an additional field to pm_message_t. Alan Stern