[linux-pm] PM models

david-b at pacbell.net (David Brownell) · Fri Dec 10 13:26:46 2004

On Friday 29 October 2004 10:26, Randy.Dunlap wrote:
> 
> > Just read the paper David did summarizing our discussions, I think it's
> > pretty clear the kind of callbacks we need.
> 
> David, can you post that paper/summary, please?

Here's the current version.  Subject to change etc.

Updates since the last version include addresssing comments
from Pavel, Benjamin, and me; especially trying to articulate
the distinctions between "System PM" (sleep states) and
"Device PM".  (I'm not sure what Benjamin means by "Dynamic PM",
but suspect it has to do with "Device PM".)  Plus some updates
based on email over the last couple days.

Again, the idea is to agree on concepts and models first,
then work down to what that means in terms of swsusp on x86,
STR using ACPI, STR on Apple/PPC hardware, maximizing battery
life on Linux cell phones, and so on.

- Dave

-------------- next part --------------
    [[ This is a DRAFT trying to capture the big picture into which	]]
    [[ upcoming PM API changes -- being discussed! -- will fit.		]]
    [[ "A consistent set of terminology, and truisms"...		]]

    [[ Brackets like this are notes/comments				]]

linux/Documentation/power/model.txt

** DRAFT **
Fri Oct 29 2004

Lots of hardware and software interacts to make power management work on
modern system.  This document starts to sort through all that, and show how
PM-aware drivers should interact with core PM code in Linux 2.6 kernels.

This document will mention ACPI power states.  Linux PM doesn't depend on
ACPI or any other firmware framework.  ACPI is mentioned both because it
provides some important examples, and because it defines (in chapter 2)
a model that's easy to map onto platforms without ACPI bytecodes.

1. MODELS
=========
Linux has two different Power Management models that affect device drivers.
Implementations (such as ACPI, APM, and platform-specific code) fit broadly
into one or both of these models:

  + "System PM" manages sleep state transitions.  In ACPI terms this means
    switching from "G0" operational states to lower power "G1" sleep states
    when the system is not in active use.  Linux sysfs access to these sleep
    states uses /sys/power/state.

    Examples of sleep states include "Suspend to RAM" and "Suspend to Disk".
    Many systems also support part of the range of "Standby" states.  The
    main CPU is in a low power state, and probably other components are too.

  + "Device PM" applies to individual devices.  In ACPI terms these apply
    only within G0 operational states, and are used to reduce runtime power
    usage by choosing device low power modes when possible.  (Using ACPI
    states like D0/D1/D2/D3.)  The main CPU isn't idle.

    Sometimes busses define low power device power mode, like PCI D3hot
    and USB suspend.  On some platforms, devices interact directly with
    the clocks and voltages that cpufreq uses, giving additional power
    saving modes that may not fit neatly into x86 PC or ACPI models.

These interact, since some of the components that sleep states put into
low power modes are devices ... which may already be in low power modes!
There's also a third model that applies to CPUs; "cpufreq" is analagous
to "Device PM".  (Just think of CPUs as very special devices.)

Some "Standby" states happen automatically when an idle CPU halts itself
with many devices already in low power modes, maybe in conjunction with
cpufreq.  Aggressively using such mechanisms helps batteries last longer,
and can help keep server room temperatures low.

Suspend-to-Disk (STD) has gotten much of the recent attention, in part
because many developers want at least their laptops to get enough sleep.

At least some parts of Device PM are partially handled outside the kernel:
"hdparm" can spin down idle disks, X11 can power down monitors, "iwconfig"
manages 802.11 parameters, and so on.

    [[ The 2.6 driver model core should support Device PM.		]]
    [[ BUT as of October 2004, it's generally acknowledged that the	]]
    [[ Linux support for Device PM is weak.				]]

1.1. System PM
--------------
System PM is what makes a Linux system go to sleep: a policy decision gets
made to enter some platform-specific sleep state, then it happens.  There are
a few conventional sleep states, entered using sysfs:

    echo standby > /sys/power/state

      Standby; ACPI calls this S1.  There are a wide variety of standby
      states, since it's so flexible:  any number of devices may be in low
      power modes and the CPU caches aren't changed.

    echo mem > /sys/power/state

      Suspend-to-RAM ("STR"); ACPI calls this S3.  Most non-wakeup devices
      are in low power modes, the CPU is "off" enough that resume starts
      with a reset (and cache reload), but DRAM is still refreshed so
      wakeup will often be relatively quick (especially on systems that
      run entirely from flash and RAM).

    echo disk > /sys/power/state

      Suspend-to-Disk ("STD"); ACPI has two variants of this, both called S4
      (in Linux terms "swsusp" and "S4bios").  Most non-wakeup devices are
      unpowered, but not necessarily all of them.  In particular, DRAM is not
      refreshed; resuming (after wakeup by some device) means reloading from
      a system snapshot and re-initializing most hardware.

	[[ NOTE:  there are actually at least four different states	]]
	[[ for "disk".  "swsusp" if /sys/power/disk is "shutdown",	]]
	[[ using ACPI S5.  other values:  "firmware" and "platform" 	]]
	[[ (one them means "s4bios"), and "reboot".  probably only	]]
	[[ one of them should be "disk"...				]]

Systems don't need to support all of those, and can support other sleep states
as defined by the platform.  For example, some platforms have a "Deep Sleep"
that keeps main DRAM refreshed and acts like ACPI S3.  Others use that name
for a state like S4 (but without a disk; not "STD"!), where only a few pages
of SRAM and the wakeup devices are kept powered.

    [[ Not all systems that list "standby" in /sys/power/state can	]]
    [[ enter a standby state; likewise with "mem".  This would probably	]]
    [[ be fixed by having better platform-specific initialization;  	]]
    [[ that could sort out the "what does 'disk' mean issue too.	]]

For all of those sleep states, drivers need to support an internal "freeze"
state for their devices, where it's idle with I/O disabled (including DMA and
IRQs).  From the hardware perspective, this resembles the state a device
will be in after device_release_driver(), though it should usually not be
a low power mode.  The difference to software is that the device driver still
manages I/O requests (maybe by just blocking a queue), wakeup events, and
device-specific low power modes.  In many cases, the device must later
enter a low power mode before the system can sleep.

    [[ There's the comment that kexec needs "freeze" states too...	]]

Different boards may treat similar devices very differently.  For example
during STR a given type of controller might stay powered on one system,
but be powered off on a different one.  One such controller might be able
to issue wakeup events, but not another.

1.2. Device PM
--------------
Device PM is what makes a device enter a low power mode.  The final choice
of what state to enter is made by the device driver, but sysfs supports
selective device suspend requests like "echo D3hot > .../power/state".

    [[ We might prefer to get rid of those files, treating these	]]
    [[ power modes as driver-internal.  We clearly can't continue	]]
    [[ having power/state be a u32; its semantics are NOT global!!	]]

The possible states are device-specific:

    + Some states may be standardized by the bus interface.  PCI devices
      may implement D1/D2/D3hot/D3cold low power modes. USB devices support
      only a single low power "suspend" state.

    + Device power mode is orthogonal to the "freeze" driver state.  Low
      power modes are not necessarily frozen, and vice versa.  Sometimes
      hardware specific optimizations are available, like PCI drivers knowing
      that all their low power modes are also frozen.

    + The device's function may define several power modes.  Some of these
      might be managed through user space software.  (Consider a device that
      manages settings on a power switch.)

    + A device might have several clocks and power rails with individual
      controls, allowing several distinct power modes.

    + Devices often share clocks with others on the same chip.  Some clocks
      can be turned off when all of the devices are inactive, saving more
      power.  (ARM Linux has clock APIs to help with these common cases.)

	[[ Those clock gating issues aren't specific to ARM Linux...	]]

    + Devices generally can't enter low power modes before their children,
      or leave them before their parents.  At least some of that logic might
      need to live in the bridge or hub drivers managing those relationships.

	[[ driver core doesn't ensure that sequencing through sysfs ...	]]

    + Devices may need to configure hardware wakeup signals before they enter
      low power modes (rather than after).  USB requires that; PCI doesn't.

As with cpufreq, a driver might support power management policies, used on
its devices while the system is running (ACPI G0).  Today these are always
driver-internal; there's no current per-device sysfs support for examining
or changing these policies.

    + A "performance" mode that's not power-aware.  This is all that
      most current drivers support:  always-on.

    + Some drivers might want to support "userspace" management.  This 
      is often done as part of a userspace API to the device, such as
      DPMS screen blanking.

    + An "ondemand" mode using minimal power except when the device is in
      active use.  This could be invisible to other software, except that
      the device may sometimes take a bit more time to respond to requests.
      The driver's request queues are still active.

      Examples, each saving a few Watts of power:  A disk might spin down.
      A USB mouse might suspend itself until waken up, eliminating a source
      of periodic DMAs that prevents some x86 CPUs from entering C3 state.

    + Almost all drivers should be able to support an "off" policy.  They
      may already support some variant of it for hot-unplug cases, in the
      period between the unplug and when the bus driver finishes unbinding
      that driver from the device.  (Cardbus devices return ~0 for reads,
      USB requests get timeouts; and so on.)

When the system is going to sleep (ACPI G1), drivers are called from PM core
code to choose power modes compatible with the upcoming sleep state.
Those requests "trump" power modes previously chosen by Device PM policy
(for ACPI G0) while the system is asleep, and can sometimes use bus-specific
defaults to choose the new device state.

2. OVERVIEW OF ENTERING SYSTEM SLEEP STATES
===========================================
This text is a bit biased towards STD cases ... but we really need it to
work for STR and Standby transitions (and others!) too.

  + Something initiates a transition, possibly from userspace.  Or a
    kernel timer might fire after being in Standby for a long time,
    waking the system just long enough initiate Suspend-to-RAM.

  + A "freeze yourself" message is broadcast to many components.  This
    involves freezing tasks and drivers.  Device trees must freeze their
    drivers bottom-up.  It may be important to receive acknowledgements
    from userspace software (like an X server); some of the applications
    may need to veto sleep requests.  There may be additional sequencing
    requirements.

	[[ NOTE vetoing was part of APM, not consistently handled.	]]
	[[ The current framework doesn't seem to want vetoes ...	]]

    When this finishes, only one Linux CPU should be active, and filesystems
    should normally have flushed all dirty pages.

  + ONLY FOR STD -- a snapshot of the RAM image must be saved to persistent
    storage.  This means that

	* Currently, 50% of memory must be free to create the snapshot.

	* At least one (swap) device gets thawed, along with its ancestor
	  devices, top-down.

	* System timers probably need to be working, also DMA and the
	  relevant IRQ controllers

	* The device performs enough I/O to make sure the image is saved
	  persistently.  (Example: it must be flushed from disk caches.)

	* Those devices presumably need to be re-frozen.

  + Drivers get told to enter power modes appropriate to the system sleep
    state.  Some busses could provide default mappings for those, into a
    safe device power mode.

    This MUST be done in a way that respects the physical topology of the
    hardware ... one of the original tasks of the 2.6 driver model work.
    You can only power down a PCI device before you power down its parent
    bridge.  You can only suspend a USB device before suspending its hub.
    (Violating these rules would mean you can't do I/O to the device
    as part of suspending it.)

    As part of this, drivers will tell many devices to enable wakeup
    (if they've not done so already for an "ondemand" Device PM policy).

  + Then the system itself enters the appropriate sleep state, using some
    platform-specific mechanism.  There are two basic routes:

	* Linux manages everything, given very detailed platform hardware
	  specifications (which might not be available, or stable).

	* Firmware ("BIOS" on PCs) handles at least part of it.  In some
	  cases this is good, like re-initializing graphics accelerators
	  or abstracting board-specific details.  But firmware bugs are
	  widespread; any given sleep state won't work on all boards.

    In most cases, Linux relies on firmware, either to enter the sleep
    state or (in one STD case) just to power down, and later power up.

  + System leaves the sleep state when some wakeup device asks it to do
    so; examples include power buttons, keyboards, mice, LAN, clock, etc.

  + System resumes.  The two basic paths here match how it entered the
    sleep state:  firmware may or may not be involved.  Typically it's
    involved

    Device drivers get notified about system resume.  If their hardware was
    reset, the driver must reinitialize.  The driver's software state will
    sometimes have been restored from an STD snapshot, reflecting "freeze"
    and not matching the current hardware state; but for STR and Standby
    states, the driver's state can always reflect having entered the low
    power mode it's resuming from.

	[[ OK, this is still shy about most resume details.		]]
	[[ Some of them seem like they'll matter.			]]
	[[ Should "resume" differ from "thaw"?				]]

3. SYSTEM VS. DEVICE:  PM INTERACTIONS
======================================

  + When Device PM has put a device into a low power mode, it might be
    able to freeze, or enter a system sleep state, with no more work.
    Likewise, resuming from that system sleep state might not need to
    affect that device:  it might stay in that low power mode.

  + Some devices don't support low power modes, which need not be a
    problem for at least "standby" states.

  + Some devices support "wakeup" capabilities that let them transition
    the system from a sleep state (such as STR/S3 or even STD/S4bios) to
    operational state.

    Examples include network adapters with "Wake On Lan" (WOL) ability,
    keyboards and mice (both PS2 and USB), and certain timers.

  + Transitioning TO a system sleep state involves letting devices change
    to a compatible power mode.  For example, a PCI device with PCI PM
    capabilities can always enter PCI D3hot.  Drivers must choose though:

	* PCI-based USB host controllers will be fine with D3hot in all
	  cases, although D1 and D2 can be used in "standby" states.

	* The driver for a 3D graphics card might not support using
	  anything except D2 unless certain resume paths are followed
	  (re-initializing or reloading firmware, also needed for
	  various network adapters).

	* Certain PCI slots might be powered off in certain states,
	  while others might keep power.  (Board-specific.)

	* When power to a device is going to be turned off, there may
	  not be much to do except early cleanup (reducing the amount
	  of data that STD must save).

    So there need to be various platform-specific hooks, possibly exposing
    data to drivers using dev->platform_data or bus-specific calls like
    pci_get_pm_details(pdev).

  + Transitioning FROM a system sleep state involves some of the same
    issues on the other end.

	* Devices won't always resume into a full power mode; and full
	  power may be undesirable too.

	* Hotpluggable devices are commonly disconnected during sleep,
	  just like they're commonly disconnected during operation.

	* Selective re-activation is used when the hardware didn't discard
	  state.  For example, USB devices in suspend state won't change
	  configuration or discard endpoint state, so there'd be no point
	  in any re-initialization.

	* Sometimes returning to operational states involves complete
	  re-initialization of the hardware.  This is typically the case
	  when the device lost power, and was thus reset.

	* For sleep states that completely power down the system (like the
	  default "swsusp" STD state, but not necessarily the S4bios STD),
	  some controllers need to handle more than just the "device was
	  reset" case.  Boot firmware sometimes leaves controllers (such
	  as for USB keyboard support) in surprising states that have no
	  resemblence to a true hardware resume.

4. PATCHES, PATCHES, PATCHES
============================
Linux needs various patches to move closer to this model, including:

[[ NOT ALL OF THESE GOT DISCUSSED IN DETAIL!!			]]
[[ Many of the API sketches here are just "doodles"			]]

    + The driver/pm core needs to support wakeup events.

	--> Draft patches have been posted.  (David Brownell)

	    EXCEPT note that ACPI wakeup-enable isn't enabled by in-kernel
	    APIs, so for example pci_enable_wake() can't make sure that
	    the device is enabled (even if you could figure out which
	    PCI device corresponded to which ACPI device) ... and there's
	    no way for the i8042/serio drivers to enable/disable wakeup
	    support on suspend().

    + The PM core code needs to stop self-deadlocking when suspend() or
      resume() operations need to remove devices.

	--> Draft patches have been posted.  (Paul Mackerras).

    + Sometimes devices may need driver/pm core calls to suspend the whole
      tree rooted at that device.  Or do we want to make that work be the
      responsibilty of those bridge or hub drivers?

    + Driver suspend() calls need to accept system-specific sleep states
      as parameters to their "suspend" calls:

	* These should be typed well enough to give compiler errors (not
	  "sparse" ones!) for drivers that expect "u32" etc ... at least
	  if we continue to call these suspend() and resume().  (Platform
	  devices would need to change, but otherwise only pmcore code
	  really makes those driver model calls.)

	    --> Still under discussion, quite possibly /sys/power could
		be built from a vector of these:

		struct system_sleep_state {
		    char *name;
		    void *platform_data;
		};

		/* example system states might be:
		 * ACPI_S1 "standby",
		 * ACPI_S3 "mem",
		 * ACPI_S4bios "disk",
		 * ACPI_S5 "disk",	(typical swsusp)
		 * ACPI_S5 "reboot",
		 * ACPI_S5 "powerdown",
		 * ACPI_S5 "halt",
		 */

		dev->driver->suspend(dev, struct system_sleep_state *)
		dev->driver->resume();

	    -->	There's some sentiment that resume() might need some
		kind of parameter.  In which case we might not need a
		separate suspend() method; the parameter might indicate
		"resume from S4bios STD" or "resume from S5 STD"

	    -->	There's also the notion that the parameter to suspend()
		continue to be some sort of enum or bitmask, perhaps a
		message something like

		struct device_pm_foo {
		    enum { FREEZE, SUSPEND, RESUME } verb;
		    int tbd_flags;	/* std phase, "gonna STR", etc */
		};

	* Drivers need to be able to choose the actual power transitions
	  they'll use.  Most drivers won't want to care, but some will
	  need access to board-specific information like:

	    - Will this device be powered down in the target sleep state?
	      This will always be true for certain STD variants.

	    - Will certain clocks will be removed?  Example, some sleep
	      states might preclude leaving a 48MHz clock active.

	    - Will its firmware need to be reloaded?  By Linux?

	    - Will system memory be powered down?

	  Enabling wakeup can also be platform/board/chip-specific; PCI is
	  probably the exception in having a standard way to enable it
	  (update bits in PM caps with pci_enable_wake).

	    --> Still under discussion.

	        That data could come through platform_data in  the device
		or system_sleep_state, or through "helper" routines like
		pci_get_pm_details().  Some state might be available only
		indirectly through those mechanisms.

		pm_is_device_powered(struct system_sleep_state *state,
				struct device *dev);
		pm_is_clocked(struct system_sleep_state *state,
				char *clock_name);
		pm_is_snapshotted(struct system_sleep_state *state):

	* PCI drivers are a particular problem, since there are so many
	  of them (not all in-kernel) using the original LK 2.4 PCI API
	  which specifies power modes (D1/D2/D3/...) to drivers rather
	  than expecting a policy-based choice.

	    --> Still under discussion; it'd be nice to avoid needing
		to change every PCI driver in Linux.

		One possible approach for PCI drivers continues supporting
		PCI calls (with typechecking updates!) but adding a new
		API allowing drivers to provide their own policies.

		The PCI bus driver model code would provide a default
		implementation of that new call that maps system sleep
		states into PCI power modes, then calls the old PCI API.

    + We need a way to freeze then thaw a device, and maybe ways to
      see if a device is currently frozen.

	--> One option:  set_freeze_state(dev, value).

	    Another: make this a system_sleep_state, but one that's
	    not accessible through /sys/power/state.

    + The driver/pm core needs to get rid of dev->power.power_state as a
      globally meaningful notion of device mode.  There is no such notion
      that's meaningful for all devices; even "on" has variations!  This
      is primarily useful for "echo -n 3 > .../power/state" support, and
      that mechanism may need to exist -- but any values should identify
      device-specific power modes.

	--> Nothing yet drafted.  Not clear that we want sysfs writes at 
	    all, much less supporting anything beyond some generic modes.
	    Maybe reading it should just return a driver-specified string,
	    and writing should be disallowed.

	--> Complete removal of dev->power.power_state has plenty of
	    fans.  But I do like the notion of sysfs reporting that
	    a device is in "PCI_D2" or "USB_SUSPEND".

    + For PCI devices, pdev->current_state should probably be read from
      the device config space.  And the offset of the pm capability should
      probably be cached, eliminating lots of repetitive lookups.

[[ Worth doing:  compare with other operating systems: Darwin, Symbian etc.	]]

[[ Also worth doing:  start ripping out more "legacy" Linux PM code, there	]]
[[ are several models and they don't all play well together ...			]]

ACKNOWLEDGEMENTS
================
Many people have contributed to Linux power management infrastructure over
the years.  This particular document incorporates ideas that came from
discussions including notably:

	David Brownell
	Nigel Cunningham
	Benjamin Herrenschmidt
	Pavel Machek
	Paul Mackerras
	Patrick Mochel
	Todd Paynor
	Alan Stern