Re: [PATCH v3] Documentation/power: Update docs about suspend and CPU hotplug

"Rafael J. Wysocki" <rjw@xxxxxxx> · Thu, 20 Oct 2011 00:08:10 +0200

On Monday, October 17, 2011, Srivatsa S. Bhat wrote:
> Update the documentation about the interaction between the suspend (S3) call
> path and the CPU hotplug infrastructure.
> This patch focusses only on the activities of the freezer, cpu hotplug and
> the notifications involved. It outlines how regular CPU hotplug differs from
> the way it is invoked during suspend and also tries to explain the locking
> involved. In addition to that, it discusses the issue of microcode update
> during CPU hotplug operations.
> 
> v3:
>    * Split the diagram into two, in order to avoid giving the wrong notion
>      that this document explains the situation of CPU hotplug and suspend
>      running in parallel.
>    * Added a short description about CPU microcode update during CPU hotplug
>      operations in different scenarios.
>    * Added a section on known issues when CPU hotplug and suspend race with
>      each other.
> 
> v2:
>    * Clarified the question, to emphasize that the document explains only
>      the difference (and similarity) in the two code paths but not what
>      happens when race conditions occur between them.
> 
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>

Applied to linux-pm/linux-next.

Thanks,
Rafael

> ---
> 
>  Documentation/power/00-INDEX                   |    2 
>  Documentation/power/suspend-and-cpuhotplug.txt |  277 ++++++++++++++++++++++++
>  2 files changed, 279 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/power/suspend-and-cpuhotplug.txt
> 
> diff --git a/Documentation/power/00-INDEX b/Documentation/power/00-INDEX
> index 45e9d4a..a4d682f 100644
> --- a/Documentation/power/00-INDEX
> +++ b/Documentation/power/00-INDEX
> @@ -26,6 +26,8 @@ s2ram.txt
>  	- How to get suspend to ram working (and debug it when it isn't)
>  states.txt
>  	- System power management states
> +suspend-and-cpuhotplug.txt
> +	- Explains the interaction between Suspend-to-RAM (S3) and CPU hotplug
>  swsusp-and-swap-files.txt
>  	- Using swap files with software suspend (to disk)
>  swsusp-dmcrypt.txt
> diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt
> new file mode 100644
> index 0000000..a1acf1b
> --- /dev/null
> +++ b/Documentation/power/suspend-and-cpuhotplug.txt
> @@ -0,0 +1,277 @@
> +Interaction of Suspend code (S3) with the CPU hotplug infrastructure
> +
> +     (C) 2011 Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
> +
> +
> +I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM
> +   infrastructure uses it internally? And where do they share common code?
> +
> +Well, a picture is worth a thousand words... So ASCII art follows :-)
> +
> +[This depicts the current design in the kernel, and focusses only on the
> +interactions involving the freezer and CPU hotplug and also tries to explain
> +the locking involved. It outlines the notifications involved as well.
> +But please note that here, only the call paths are illustrated, with the aim
> +of describing where they take different paths and where they share code.
> +What happens when regular CPU hotplug and Suspend-to-RAM race with each other
> +is not depicted here.]
> +
> +On a high level, the suspend-resume cycle goes like this:
> +
> +|Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw |
> +|tasks |    |     cpus      |    |          |    |     cpus     |    |tasks|
> +
> +
> +More details follow:
> +
> +                                Suspend call path
> +                                -----------------
> +
> +                                  Write 'mem' to
> +                                /sys/power/state
> +                                    syfs file
> +                                        |
> +                                        v
> +                               Acquire pm_mutex lock
> +                                        |
> +                                        v
> +                             Send PM_SUSPEND_PREPARE
> +                                   notifications
> +                                        |
> +                                        v
> +                                   Freeze tasks
> +                                        |
> +                                        |
> +                                        v
> +                              disable_nonboot_cpus()
> +                                   /* start */
> +                                        |
> +                                        v
> +                            Acquire cpu_add_remove_lock
> +                                        |
> +                                        v
> +                             Iterate over CURRENTLY
> +                                   online CPUs
> +                                        |
> +                                        |
> +                                        |                ----------
> +                                        v                          | L
> +             ======>               _cpu_down()                     |
> +            |              [This takes cpuhotplug.lock             |
> +  Common    |               before taking down the CPU             |
> +   code     |               and releases it when done]             | O
> +            |            While it is at it, notifications          |
> +            |            are sent when notable events occur,       |
> +             ======>     by running all registered callbacks.      |
> +                                        |                          | O
> +                                        |                          |
> +                                        |                          |
> +                                        v                          |
> +                            Note down these cpus in                | P
> +                                frozen_cpus mask         ----------
> +                                        |
> +                                        v
> +                           Disable regular cpu hotplug
> +                        by setting cpu_hotplug_disabled=1
> +                                        |
> +                                        v
> +                            Release cpu_add_remove_lock
> +                                        |
> +                                        v
> +                       /* disable_nonboot_cpus() complete */
> +                                        |
> +                                        v
> +                                   Do suspend
> +
> +
> +
> +Resuming back is likewise, with the counterparts being (in the order of
> +execution during resume):
> +* enable_nonboot_cpus() which involves:
> +   |  Acquire cpu_add_remove_lock
> +   |  Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug
> +   |  Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop]
> +   |  Release cpu_add_remove_lock
> +   v
> +
> +* thaw tasks
> +* send PM_POST_SUSPEND notifications
> +* Release pm_mutex lock.
> +
> +
> +It is to be noted here that the pm_mutex lock is acquired at the very
> +beginning, when we are just starting out to suspend, and then released only
> +after the entire cycle is complete (i.e., suspend + resume).
> +
> +
> +
> +                          Regular CPU hotplug call path
> +                          -----------------------------
> +
> +                                Write 0 (or 1) to
> +                       /sys/devices/system/cpu/cpu*/online
> +                                    sysfs file
> +                                        |
> +                                        |
> +                                        v
> +                                    cpu_down()
> +                                        |
> +                                        v
> +                           Acquire cpu_add_remove_lock
> +                                        |
> +                                        v
> +                          If cpu_hotplug_disabled is 1
> +                                return gracefully
> +                                        |
> +                                        |
> +                                        v
> +             ======>                _cpu_down()
> +            |              [This takes cpuhotplug.lock
> +  Common    |               before taking down the CPU
> +   code     |               and releases it when done]
> +            |            While it is at it, notifications
> +            |           are sent when notable events occur,
> +             ======>    by running all registered callbacks.
> +                                        |
> +                                        |
> +                                        v
> +                          Release cpu_add_remove_lock
> +                               [That's it!, for
> +                              regular CPU hotplug]
> +
> +
> +
> +So, as can be seen from the two diagrams (the parts marked as "Common code"),
> +regular CPU hotplug and the suspend code path converge at the _cpu_down() and
> +_cpu_up() functions. They differ in the arguments passed to these functions,
> +in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen'
> +argument. But during suspend, since the tasks are already frozen by the time
> +the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called
> +with the 'tasks_frozen' argument set to 1.
> +[See below for some known issues regarding this.]
> +
> +
> +Important files and functions/entry points:
> +------------------------------------------
> +
> +kernel/power/process.c : freeze_processes(), thaw_processes()
> +kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
> +kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus()
> +
> +
> +
> +II. What are the issues involved in CPU hotplug?
> +    -------------------------------------------
> +
> +There are some interesting situations involving CPU hotplug and microcode
> +update on the CPUs, as discussed below:
> +
> +[Please bear in mind that the kernel requests the microcode images from
> +userspace, using the request_firmware() function defined in
> +drivers/base/firmware_class.c]
> +
> +
> +a. When all the CPUs are identical:
> +
> +   This is the most common situation and it is quite straightforward: we want
> +   to apply the same microcode revision to each of the CPUs.
> +   To give an example of x86, the collect_cpu_info() function defined in
> +   arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU
> +   and thereby in applying the correct microcode revision to it.
> +   But note that the kernel does not maintain a common microcode image for the
> +   all CPUs, in order to handle case 'b' described below.
> +
> +
> +b. When some of the CPUs are different than the rest:
> +
> +   In this case since we probably need to apply different microcode revisions
> +   to different CPUs, the kernel maintains a copy of the correct microcode
> +   image for each CPU (after appropriate CPU type/model discovery using
> +   functions such as collect_cpu_info()).
> +
> +
> +c. When a CPU is physically hot-unplugged and a new (and possibly different
> +   type of) CPU is hot-plugged into the system:
> +
> +   In the current design of the kernel, whenever a CPU is taken offline during
> +   a regular CPU hotplug operation, upon receiving the CPU_DEAD notification
> +   (which is sent by the CPU hotplug code), the microcode update driver's
> +   callback for that event reacts by freeing the kernel's copy of the
> +   microcode image for that CPU.
> +
> +   Hence, when a new CPU is brought online, since the kernel finds that it
> +   doesn't have the microcode image, it does the CPU type/model discovery
> +   afresh and then requests the userspace for the appropriate microcode image
> +   for that CPU, which is subsequently applied.
> +
> +   For example, in x86, the mc_cpu_callback() function (which is the microcode
> +   update driver's callback registered for CPU hotplug events) calls
> +   microcode_update_cpu() which would call microcode_init_cpu() in this case,
> +   instead of microcode_resume_cpu() when it finds that the kernel doesn't
> +   have a valid microcode image. This ensures that the CPU type/model
> +   discovery is performed and the right microcode is applied to the CPU after
> +   getting it from userspace.
> +
> +
> +d. Handling microcode update during suspend/hibernate:
> +
> +   Strictly speaking, during a CPU hotplug operation which does not involve
> +   physically removing or inserting CPUs, the CPUs are not actually powered
> +   off during a CPU offline. They are just put to the lowest C-states possible.
> +   Hence, in such a case, it is not really necessary to re-apply microcode
> +   when the CPUs are brought back online, since they wouldn't have lost the
> +   image during the CPU offline operation.
> +
> +   This is the usual scenario encountered during a resume after a suspend.
> +   However, in the case of hibernation, since all the CPUs are completely
> +   powered off, during restore it becomes necessary to apply the microcode
> +   images to all the CPUs.
> +
> +   [Note that we don't expect someone to physically pull out nodes and insert
> +   nodes with a different type of CPUs in-between a suspend-resume or a
> +   hibernate/restore cycle.]
> +
> +   In the current design of the kernel however, during a CPU offline operation
> +   as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification),
> +   the existing copy of microcode image in the kernel is not freed up.
> +   And during the CPU online operations (during resume/restore), since the
> +   kernel finds that it already has copies of the microcode images for all the
> +   CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU
> +   type/model and the need for validating whether the microcode revisions are
> +   right for the CPUs or not (due to the above assumption that physical CPU
> +   hotplug will not be done in-between suspend/resume or hibernate/restore
> +   cycles).
> +
> +
> +III. Are there any known problems when regular CPU hotplug and suspend race
> +     with each other?
> +
> +Yes, they are listed below:
> +
> +1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to
> +   the _cpu_down() and _cpu_up() functions is *always* 0.
> +   This might not reflect the true current state of the system, since the
> +   tasks could have been frozen by an out-of-band event such as a suspend
> +   operation in progress. Hence, it will lead to wrong notifications being
> +   sent during the cpu online/offline events (eg, CPU_ONLINE notification
> +   instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of
> +   inappropriate code by the callbacks registered for such CPU hotplug events.
> +
> +2. If a regular CPU hotplug stress test happens to race with the freezer due
> +   to a suspend operation in progress at the same time, then we could hit the
> +   situation described below:
> +
> +    * A regular cpu online operation continues its journey from userspace
> +      into the kernel, since the freezing has not yet begun.
> +    * Then freezer gets to work and freezes userspace.
> +    * If cpu online has not yet completed the microcode update stuff by now,
> +      it will now start waiting on the frozen userspace in the
> +      TASK_UNINTERRUPTIBLE state, in order to get the microcode image.
> +    * Now the freezer continues and tries to freeze the remaining tasks. But
> +      due to this wait mentioned above, the freezer won't be able to freeze
> +      the cpu online hotplug task and hence freezing of tasks fails.
> +
> +   As a result of this task freezing failure, the suspend operation gets
> +   aborted.
> +
> +
> 
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html