[LSF/MM/BPF Topic] Energy-Efficient I/O

Bart Van Assche <bvanassche@xxxxxxx> · Mon, 27 Jan 2025 14:34:35 -0800

Energy efficiency is very important for battery-powered devices like
smartphones. In battery-powered devices, CPU cores and peripherals
support multiple power states. A lower power state is entered if no work
is pending. Typically the more power that is saved, the more time it
takes to exit the power saving state.

Switching to a lower power state if no work is pending works well for
CPU-intensive tasks but is not optimal for latency-sensitive tasks like
block I/O with a low queue depth. If a CPU core transitions to a lower
power state after each I/O has been submitted and has to be woken up
every time an I/O completes, this can increase I/O latency
significantly. The cpu_latency_qos_update_request(..., max_latency)
function can be used to specify a maximum wakeup latency and hence can
be used to prevent a transition to a lower power state before an I/O
completes. However, cpu_latency_qos_update_request() is too expensive to
be called from the I/O submission path for every request.

In the UFS driver the cpu_latency_qos_update_request() is called from
the devfreq_dev_profile::target() callback. That callback checks the
hba->clk_scaling.active_reqs variable, a variable that tracks the number
of outstanding commands. Updates of that variable are protected by a
spinlock and hence are a contention point. Having to maintain this or a
similar infrastructure in every block driver is not ideal.

A possible solution is to tie QoS updates to the runtime-power
management (RPM) mechanism. The block layer interacts as follows with
the RPM mechanism:
* pm_runtime_mark_last_busy(dev) is called by the block layer upon
  request completion. This call updates dev->power.last_busy. The RPM
  mechanism uses this information to decide when to check whether a
  block device can be suspended.
* pm_request_resume() is called by the block layer if a block device has
  been runtime suspended and needs to be resumed.
* If the RPM timer expires, the block driver .runtime_suspend() callback
  is invoked. The .runtime_suspend() callback is expected to call
  blk_pre_runtime_suspend() and blk_post_runtime_suspend().
  blk_pre_runtime_suspend() checks whether q->q_usage_counter is zero.

It is not my goal to replace the iowait boost mechanism. This mechanism
boosts the CPU frequency when a task that is in the iowait state wakes
up after the I/O operation completes.

The purpose of this session is to discuss the following:
* A solution that exists in the block layer instead of in block drivers.
* A solution that does not cause contention between block layer hardware
  queues.
* A solution that does not measurable increase the number of CPU cycles
  per I/O.
* A solution that does not require users to provide I/O latency
  estimates.

See also:
* https://www.kernel.org/doc/Documentation/power/pm_qos_interface.txt
* Tero Kristo, [PATCHv2 0/2] blk-mq: add CPU latency limit control,
  2024-10-18 
(https://lore.kernel.org/linux-block/20241018075416.436916-1-tero.kristo@xxxxxxxxxxxxxxx/).
* The cpu_latency_constraints definition in kernel/power/qos.c.

Thanks,

Bart.