On Friday 29 October 2004 10:26, Randy.Dunlap wrote: > > > Just read the paper David did summarizing our discussions, I think it's > > pretty clear the kind of callbacks we need. > > David, can you post that paper/summary, please? Here's the current version. Subject to change etc. Updates since the last version include addresssing comments from Pavel, Benjamin, and me; especially trying to articulate the distinctions between "System PM" (sleep states) and "Device PM". (I'm not sure what Benjamin means by "Dynamic PM", but suspect it has to do with "Device PM".) Plus some updates based on email over the last couple days. Again, the idea is to agree on concepts and models first, then work down to what that means in terms of swsusp on x86, STR using ACPI, STR on Apple/PPC hardware, maximizing battery life on Linux cell phones, and so on. - Dave -------------- next part -------------- [[ This is a DRAFT trying to capture the big picture into which ]] [[ upcoming PM API changes -- being discussed! -- will fit. ]] [[ "A consistent set of terminology, and truisms"... ]] [[ Brackets like this are notes/comments ]] linux/Documentation/power/model.txt ** DRAFT ** Fri Oct 29 2004 Lots of hardware and software interacts to make power management work on modern system. This document starts to sort through all that, and show how PM-aware drivers should interact with core PM code in Linux 2.6 kernels. This document will mention ACPI power states. Linux PM doesn't depend on ACPI or any other firmware framework. ACPI is mentioned both because it provides some important examples, and because it defines (in chapter 2) a model that's easy to map onto platforms without ACPI bytecodes. 1. MODELS ========= Linux has two different Power Management models that affect device drivers. Implementations (such as ACPI, APM, and platform-specific code) fit broadly into one or both of these models: + "System PM" manages sleep state transitions. In ACPI terms this means switching from "G0" operational states to lower power "G1" sleep states when the system is not in active use. Linux sysfs access to these sleep states uses /sys/power/state. Examples of sleep states include "Suspend to RAM" and "Suspend to Disk". Many systems also support part of the range of "Standby" states. The main CPU is in a low power state, and probably other components are too. + "Device PM" applies to individual devices. In ACPI terms these apply only within G0 operational states, and are used to reduce runtime power usage by choosing device low power modes when possible. (Using ACPI states like D0/D1/D2/D3.) The main CPU isn't idle. Sometimes busses define low power device power mode, like PCI D3hot and USB suspend. On some platforms, devices interact directly with the clocks and voltages that cpufreq uses, giving additional power saving modes that may not fit neatly into x86 PC or ACPI models. These interact, since some of the components that sleep states put into low power modes are devices ... which may already be in low power modes! There's also a third model that applies to CPUs; "cpufreq" is analagous to "Device PM". (Just think of CPUs as very special devices.) Some "Standby" states happen automatically when an idle CPU halts itself with many devices already in low power modes, maybe in conjunction with cpufreq. Aggressively using such mechanisms helps batteries last longer, and can help keep server room temperatures low. Suspend-to-Disk (STD) has gotten much of the recent attention, in part because many developers want at least their laptops to get enough sleep. At least some parts of Device PM are partially handled outside the kernel: "hdparm" can spin down idle disks, X11 can power down monitors, "iwconfig" manages 802.11 parameters, and so on. [[ The 2.6 driver model core should support Device PM. ]] [[ BUT as of October 2004, it's generally acknowledged that the ]] [[ Linux support for Device PM is weak. ]] 1.1. System PM -------------- System PM is what makes a Linux system go to sleep: a policy decision gets made to enter some platform-specific sleep state, then it happens. There are a few conventional sleep states, entered using sysfs: echo standby > /sys/power/state Standby; ACPI calls this S1. There are a wide variety of standby states, since it's so flexible: any number of devices may be in low power modes and the CPU caches aren't changed. echo mem > /sys/power/state Suspend-to-RAM ("STR"); ACPI calls this S3. Most non-wakeup devices are in low power modes, the CPU is "off" enough that resume starts with a reset (and cache reload), but DRAM is still refreshed so wakeup will often be relatively quick (especially on systems that run entirely from flash and RAM). echo disk > /sys/power/state Suspend-to-Disk ("STD"); ACPI has two variants of this, both called S4 (in Linux terms "swsusp" and "S4bios"). Most non-wakeup devices are unpowered, but not necessarily all of them. In particular, DRAM is not refreshed; resuming (after wakeup by some device) means reloading from a system snapshot and re-initializing most hardware. [[ NOTE: there are actually at least four different states ]] [[ for "disk". "swsusp" if /sys/power/disk is "shutdown", ]] [[ using ACPI S5. other values: "firmware" and "platform" ]] [[ (one them means "s4bios"), and "reboot". probably only ]] [[ one of them should be "disk"... ]] Systems don't need to support all of those, and can support other sleep states as defined by the platform. For example, some platforms have a "Deep Sleep" that keeps main DRAM refreshed and acts like ACPI S3. Others use that name for a state like S4 (but without a disk; not "STD"!), where only a few pages of SRAM and the wakeup devices are kept powered. [[ Not all systems that list "standby" in /sys/power/state can ]] [[ enter a standby state; likewise with "mem". This would probably ]] [[ be fixed by having better platform-specific initialization; ]] [[ that could sort out the "what does 'disk' mean issue too. ]] For all of those sleep states, drivers need to support an internal "freeze" state for their devices, where it's idle with I/O disabled (including DMA and IRQs). From the hardware perspective, this resembles the state a device will be in after device_release_driver(), though it should usually not be a low power mode. The difference to software is that the device driver still manages I/O requests (maybe by just blocking a queue), wakeup events, and device-specific low power modes. In many cases, the device must later enter a low power mode before the system can sleep. [[ There's the comment that kexec needs "freeze" states too... ]] Different boards may treat similar devices very differently. For example during STR a given type of controller might stay powered on one system, but be powered off on a different one. One such controller might be able to issue wakeup events, but not another. 1.2. Device PM -------------- Device PM is what makes a device enter a low power mode. The final choice of what state to enter is made by the device driver, but sysfs supports selective device suspend requests like "echo D3hot > .../power/state". [[ We might prefer to get rid of those files, treating these ]] [[ power modes as driver-internal. We clearly can't continue ]] [[ having power/state be a u32; its semantics are NOT global!! ]] The possible states are device-specific: + Some states may be standardized by the bus interface. PCI devices may implement D1/D2/D3hot/D3cold low power modes. USB devices support only a single low power "suspend" state. + Device power mode is orthogonal to the "freeze" driver state. Low power modes are not necessarily frozen, and vice versa. Sometimes hardware specific optimizations are available, like PCI drivers knowing that all their low power modes are also frozen. + The device's function may define several power modes. Some of these might be managed through user space software. (Consider a device that manages settings on a power switch.) + A device might have several clocks and power rails with individual controls, allowing several distinct power modes. + Devices often share clocks with others on the same chip. Some clocks can be turned off when all of the devices are inactive, saving more power. (ARM Linux has clock APIs to help with these common cases.) [[ Those clock gating issues aren't specific to ARM Linux... ]] + Devices generally can't enter low power modes before their children, or leave them before their parents. At least some of that logic might need to live in the bridge or hub drivers managing those relationships. [[ driver core doesn't ensure that sequencing through sysfs ... ]] + Devices may need to configure hardware wakeup signals before they enter low power modes (rather than after). USB requires that; PCI doesn't. As with cpufreq, a driver might support power management policies, used on its devices while the system is running (ACPI G0). Today these are always driver-internal; there's no current per-device sysfs support for examining or changing these policies. + A "performance" mode that's not power-aware. This is all that most current drivers support: always-on. + Some drivers might want to support "userspace" management. This is often done as part of a userspace API to the device, such as DPMS screen blanking. + An "ondemand" mode using minimal power except when the device is in active use. This could be invisible to other software, except that the device may sometimes take a bit more time to respond to requests. The driver's request queues are still active. Examples, each saving a few Watts of power: A disk might spin down. A USB mouse might suspend itself until waken up, eliminating a source of periodic DMAs that prevents some x86 CPUs from entering C3 state. + Almost all drivers should be able to support an "off" policy. They may already support some variant of it for hot-unplug cases, in the period between the unplug and when the bus driver finishes unbinding that driver from the device. (Cardbus devices return ~0 for reads, USB requests get timeouts; and so on.) When the system is going to sleep (ACPI G1), drivers are called from PM core code to choose power modes compatible with the upcoming sleep state. Those requests "trump" power modes previously chosen by Device PM policy (for ACPI G0) while the system is asleep, and can sometimes use bus-specific defaults to choose the new device state. 2. OVERVIEW OF ENTERING SYSTEM SLEEP STATES =========================================== This text is a bit biased towards STD cases ... but we really need it to work for STR and Standby transitions (and others!) too. + Something initiates a transition, possibly from userspace. Or a kernel timer might fire after being in Standby for a long time, waking the system just long enough initiate Suspend-to-RAM. + A "freeze yourself" message is broadcast to many components. This involves freezing tasks and drivers. Device trees must freeze their drivers bottom-up. It may be important to receive acknowledgements from userspace software (like an X server); some of the applications may need to veto sleep requests. There may be additional sequencing requirements. [[ NOTE vetoing was part of APM, not consistently handled. ]] [[ The current framework doesn't seem to want vetoes ... ]] When this finishes, only one Linux CPU should be active, and filesystems should normally have flushed all dirty pages. + ONLY FOR STD -- a snapshot of the RAM image must be saved to persistent storage. This means that * Currently, 50% of memory must be free to create the snapshot. * At least one (swap) device gets thawed, along with its ancestor devices, top-down. * System timers probably need to be working, also DMA and the relevant IRQ controllers * The device performs enough I/O to make sure the image is saved persistently. (Example: it must be flushed from disk caches.) * Those devices presumably need to be re-frozen. + Drivers get told to enter power modes appropriate to the system sleep state. Some busses could provide default mappings for those, into a safe device power mode. This MUST be done in a way that respects the physical topology of the hardware ... one of the original tasks of the 2.6 driver model work. You can only power down a PCI device before you power down its parent bridge. You can only suspend a USB device before suspending its hub. (Violating these rules would mean you can't do I/O to the device as part of suspending it.) As part of this, drivers will tell many devices to enable wakeup (if they've not done so already for an "ondemand" Device PM policy). + Then the system itself enters the appropriate sleep state, using some platform-specific mechanism. There are two basic routes: * Linux manages everything, given very detailed platform hardware specifications (which might not be available, or stable). * Firmware ("BIOS" on PCs) handles at least part of it. In some cases this is good, like re-initializing graphics accelerators or abstracting board-specific details. But firmware bugs are widespread; any given sleep state won't work on all boards. In most cases, Linux relies on firmware, either to enter the sleep state or (in one STD case) just to power down, and later power up. + System leaves the sleep state when some wakeup device asks it to do so; examples include power buttons, keyboards, mice, LAN, clock, etc. + System resumes. The two basic paths here match how it entered the sleep state: firmware may or may not be involved. Typically it's involved Device drivers get notified about system resume. If their hardware was reset, the driver must reinitialize. The driver's software state will sometimes have been restored from an STD snapshot, reflecting "freeze" and not matching the current hardware state; but for STR and Standby states, the driver's state can always reflect having entered the low power mode it's resuming from. [[ OK, this is still shy about most resume details. ]] [[ Some of them seem like they'll matter. ]] [[ Should "resume" differ from "thaw"? ]] 3. SYSTEM VS. DEVICE: PM INTERACTIONS ====================================== + When Device PM has put a device into a low power mode, it might be able to freeze, or enter a system sleep state, with no more work. Likewise, resuming from that system sleep state might not need to affect that device: it might stay in that low power mode. + Some devices don't support low power modes, which need not be a problem for at least "standby" states. + Some devices support "wakeup" capabilities that let them transition the system from a sleep state (such as STR/S3 or even STD/S4bios) to operational state. Examples include network adapters with "Wake On Lan" (WOL) ability, keyboards and mice (both PS2 and USB), and certain timers. + Transitioning TO a system sleep state involves letting devices change to a compatible power mode. For example, a PCI device with PCI PM capabilities can always enter PCI D3hot. Drivers must choose though: * PCI-based USB host controllers will be fine with D3hot in all cases, although D1 and D2 can be used in "standby" states. * The driver for a 3D graphics card might not support using anything except D2 unless certain resume paths are followed (re-initializing or reloading firmware, also needed for various network adapters). * Certain PCI slots might be powered off in certain states, while others might keep power. (Board-specific.) * When power to a device is going to be turned off, there may not be much to do except early cleanup (reducing the amount of data that STD must save). So there need to be various platform-specific hooks, possibly exposing data to drivers using dev->platform_data or bus-specific calls like pci_get_pm_details(pdev). + Transitioning FROM a system sleep state involves some of the same issues on the other end. * Devices won't always resume into a full power mode; and full power may be undesirable too. * Hotpluggable devices are commonly disconnected during sleep, just like they're commonly disconnected during operation. * Selective re-activation is used when the hardware didn't discard state. For example, USB devices in suspend state won't change configuration or discard endpoint state, so there'd be no point in any re-initialization. * Sometimes returning to operational states involves complete re-initialization of the hardware. This is typically the case when the device lost power, and was thus reset. * For sleep states that completely power down the system (like the default "swsusp" STD state, but not necessarily the S4bios STD), some controllers need to handle more than just the "device was reset" case. Boot firmware sometimes leaves controllers (such as for USB keyboard support) in surprising states that have no resemblence to a true hardware resume. 4. PATCHES, PATCHES, PATCHES ============================ Linux needs various patches to move closer to this model, including: [[ NOT ALL OF THESE GOT DISCUSSED IN DETAIL!! ]] [[ Many of the API sketches here are just "doodles" ]] + The driver/pm core needs to support wakeup events. --> Draft patches have been posted. (David Brownell) EXCEPT note that ACPI wakeup-enable isn't enabled by in-kernel APIs, so for example pci_enable_wake() can't make sure that the device is enabled (even if you could figure out which PCI device corresponded to which ACPI device) ... and there's no way for the i8042/serio drivers to enable/disable wakeup support on suspend(). + The PM core code needs to stop self-deadlocking when suspend() or resume() operations need to remove devices. --> Draft patches have been posted. (Paul Mackerras). + Sometimes devices may need driver/pm core calls to suspend the whole tree rooted at that device. Or do we want to make that work be the responsibilty of those bridge or hub drivers? + Driver suspend() calls need to accept system-specific sleep states as parameters to their "suspend" calls: * These should be typed well enough to give compiler errors (not "sparse" ones!) for drivers that expect "u32" etc ... at least if we continue to call these suspend() and resume(). (Platform devices would need to change, but otherwise only pmcore code really makes those driver model calls.) --> Still under discussion, quite possibly /sys/power could be built from a vector of these: struct system_sleep_state { char *name; void *platform_data; }; /* example system states might be: * ACPI_S1 "standby", * ACPI_S3 "mem", * ACPI_S4bios "disk", * ACPI_S5 "disk", (typical swsusp) * ACPI_S5 "reboot", * ACPI_S5 "powerdown", * ACPI_S5 "halt", */ dev->driver->suspend(dev, struct system_sleep_state *) dev->driver->resume(); --> There's some sentiment that resume() might need some kind of parameter. In which case we might not need a separate suspend() method; the parameter might indicate "resume from S4bios STD" or "resume from S5 STD" --> There's also the notion that the parameter to suspend() continue to be some sort of enum or bitmask, perhaps a message something like struct device_pm_foo { enum { FREEZE, SUSPEND, RESUME } verb; int tbd_flags; /* std phase, "gonna STR", etc */ }; * Drivers need to be able to choose the actual power transitions they'll use. Most drivers won't want to care, but some will need access to board-specific information like: - Will this device be powered down in the target sleep state? This will always be true for certain STD variants. - Will certain clocks will be removed? Example, some sleep states might preclude leaving a 48MHz clock active. - Will its firmware need to be reloaded? By Linux? - Will system memory be powered down? Enabling wakeup can also be platform/board/chip-specific; PCI is probably the exception in having a standard way to enable it (update bits in PM caps with pci_enable_wake). --> Still under discussion. That data could come through platform_data in the device or system_sleep_state, or through "helper" routines like pci_get_pm_details(). Some state might be available only indirectly through those mechanisms. pm_is_device_powered(struct system_sleep_state *state, struct device *dev); pm_is_clocked(struct system_sleep_state *state, char *clock_name); pm_is_snapshotted(struct system_sleep_state *state): * PCI drivers are a particular problem, since there are so many of them (not all in-kernel) using the original LK 2.4 PCI API which specifies power modes (D1/D2/D3/...) to drivers rather than expecting a policy-based choice. --> Still under discussion; it'd be nice to avoid needing to change every PCI driver in Linux. One possible approach for PCI drivers continues supporting PCI calls (with typechecking updates!) but adding a new API allowing drivers to provide their own policies. The PCI bus driver model code would provide a default implementation of that new call that maps system sleep states into PCI power modes, then calls the old PCI API. + We need a way to freeze then thaw a device, and maybe ways to see if a device is currently frozen. --> One option: set_freeze_state(dev, value). Another: make this a system_sleep_state, but one that's not accessible through /sys/power/state. + The driver/pm core needs to get rid of dev->power.power_state as a globally meaningful notion of device mode. There is no such notion that's meaningful for all devices; even "on" has variations! This is primarily useful for "echo -n 3 > .../power/state" support, and that mechanism may need to exist -- but any values should identify device-specific power modes. --> Nothing yet drafted. Not clear that we want sysfs writes at all, much less supporting anything beyond some generic modes. Maybe reading it should just return a driver-specified string, and writing should be disallowed. --> Complete removal of dev->power.power_state has plenty of fans. But I do like the notion of sysfs reporting that a device is in "PCI_D2" or "USB_SUSPEND". + For PCI devices, pdev->current_state should probably be read from the device config space. And the offset of the pm capability should probably be cached, eliminating lots of repetitive lookups. [[ Worth doing: compare with other operating systems: Darwin, Symbian etc. ]] [[ Also worth doing: start ripping out more "legacy" Linux PM code, there ]] [[ are several models and they don't all play well together ... ]] ACKNOWLEDGEMENTS ================ Many people have contributed to Linux power management infrastructure over the years. This particular document incorporates ideas that came from discussions including notably: David Brownell Nigel Cunningham Benjamin Herrenschmidt Pavel Machek Paul Mackerras Patrick Mochel Todd Paynor Alan Stern