[linux-pm] Some thoughts on suspend/resume development

stern at rowland.harvard.edu (Alan Stern) · Sat Mar 5 07:37:47 2005

Now that 2.6.11 is out, we can start to address a number of important
issues for power management.  This overview discusses some of them,
the things I have run across in my own work.

Up to now, the PM development effort has been concerned primarily with
system-wide sleep transitions, things like Suspend-To-RAM (STR) and
Suspend-To-Disk (STD).  (A more general, less PC-centric description
would call these states "deep sleep" and "shallow sleep".  A third
possible state, which some people might be in favor of, is Standby or
"very shallow sleep".)  The important thing here is that these involve
global transitions, affecting every device in the system.  Also, they
don't involve many policy decisions (which clocks to turn off and so
on), and in any case such decisions are outside the scope of the PM
core.

The implementation of sleep transitions has gone reasonably smoothly,
and people generally understand the problems and what still needs to
be done.  There's one thing I'd like to mention: the refrigerator.
Ben H. has said that putting processes in the refrigerator at the
start of a sleep transition is an unnecessary waste of time and I
tend to agree, with a couple of provisos:

	In SMP systems, things would quickly get very confusing if
	more the one processor was actively running during a sleep
	transition.  I don't know how this is handled currently, but
	presumably it's not a major problem.

	In STD, it's important that most processes do not run at all
	once the memory image has been captured.  In particular they
	must not run while the image is being stored on the disk.
	But sometimes a few processes have to be allowed to run (e.g.,
	those needed for performing disk I/O).  The scheduler must be
	told that only processes marked PF_NOFREEZE can be allowed to
	run.

If those two things are handled correctly, I don't see any need for
the refrigerator.  Maybe someone can point out reasons that I haven't
thought of.

Now it's time to consider how to implement additional power-saving
measures -- in other words, selective suspends.  Support currently is
minimal and there are several important matters to settle.

Typically selective suspend comes in two forms: drivers automatically
reducing power usage by a device after a period of inactivity (let's
call this "auto suspend"), and userspace-initiated changes carried out
by writing to /sys/device/.../power/state ("user suspend").  They are
rather different from system-wide sleep transitions, and each has its
own set of problems.

A common problem for all selective suspends is that, unlike system
sleeps, they can occur at any time.  Drivers will get very confused
unless we can guarantee somehow that, at a minimum, they will not
receive a suspend or resume call for a device while its probe or
release routine is running.  Some general sort of mutual exclusion is
needed, something more than just dpm_sem.  (Individual subsystems may
have even stronger requirements for mutual exclusion.  For example,
the USB drivers require that changing device configurations and
performing port resets are exclusive with each other and both
suspend/resume and probe/release.)  I rather think that the need for
this will be so widespread that it deserves to be integrated into the
driver model core.

An important difference between system sleep and selective suspend is
that with selective suspend, we generally expect the device to resume
on demand.  This demand may take the form of a request to the driver
(e.g., a block I/O request for a disk device) or a resume request from
the device itself (e.g., a notification from a mouse that has just
been moved).  This means that input queues must not be plugged and
device interrupts must remain enabled, exactly the opposite of what
happens during system sleep.  For this reason it is vital for drivers
to know whether a suspend call is invoking a system sleep or a
selective suspend.  Hence I propose that a new pm_message_t event code,
PMSG_SELECTIVE (or maybe PMSG_SELECTIVE_SUSPEND), be used for selective
suspends.

With resume-on-demand implemented properly, a driver may decide that
it can suspend its device without bothering to suspend the device's
children.  This kind of decision should be left to individual drivers
and the PM core shouldn't try to enforce a "children must be suspended
before their parents" policy for selective suspends.

A common problem facing all drivers that do auto suspend is how to set
the inactivity timeout.  Two possible answers are: add an attribute
file in the /sys/.../power directory (so different devices can have
different timeouts), or add a driver module parameter (so all devices
using the same driver will have the same timeout).  The module
parameter approach is more efficient, but it suffers from the drawback
that a driver is not notified when a parameter is changed!  So how
should we handle the situation where a user decreases the timeout
value?  The case where the timeout value is _increased_ poses no
problem; when the original timer expires the driver can figure out
that it's not yet time to do the suspend.

For user suspends (made through sysfs) the user may want to convey
arbitrary information to a driver, things like which clocks to turn
off, which power level to change to, and so on.  This information
will vary from driver to driver, and the PM core shouldn't even try to
impose any sort of structure on it.  I think the best approach will be
to pass to the driver a character pointer giving the data written to
/sys/.../power/state, so that users can send whatever they want just
by writing it to the file.  This means adding an additional field to
pm_message_t.

Alan Stern