Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

Christian Loehle <christian.loehle@xxxxxxx> · Mon, 25 Mar 2024 12:24:02 +0000

On 22/03/2024 18:08, Vincent Guittot wrote:
> Hi Christian,
Hi Vincent,
thanks for taking a look.

> 
> On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@xxxxxxx> wrote:
>>
>> There is a feature inside of both schedutil and intel_pstate called
>> iowait boosting which tries to prevent selecting a low frequency
>> during IO workloads when it impacts throughput.
>> The feature is implemented by checking for task wakeups that have
>> the in_iowait flag set and boost the CPU of the rq accordingly
>> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>>
>> The necessity of the feature is argued with the potentially low
>> utilization of a task being frequently in_iowait (i.e. most of the
>> time not enqueued on any rq and cannot build up utilization).
>>
>> The RFC focuses on the schedutil implementation.
>> intel_pstate frequency selection isn't touched for now, suggestions are
>> very welcome.
>> Current schedutil iowait boosting has several issues:
>> 1. Boosting happens even in scenarios where it doesn't improve
>> throughput. [1]
>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>>  the actual utilization for task placement, but another CPU might be
>>  more energy-efficient at that capacity than the boosted one.)
>>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>>  will not consider the (potentially 'free') boosted capacity, but the
>>  one it would have without the boost (since the boost is only applied
>>  in sugov).
>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>> in_iowait wakeups.
>> 4. The boost isn't associated with a task, it therefore isn't considered
>> for task placement, potentially missing out on higher capacity CPUs on
>> heterogeneous CPU topologies.
>> 5. The boost isn't associated with a task, it therefore lingers on the
>> rq even after the responsible task has migrated / stopped.
>> 6. The boost isn't associated with a task, it therefore needs to ramp
>> up again when migrated.
>> 7. Since schedutil doesn't know which task is getting woken up,
>> multiple unrelated in_iowait tasks might lead to boosting.
>>
>> We attempt to mitigate all of the above by reworking the way the
>> iowait boosting (io boosting from here on) works in two major ways:
>> - Carry the boost in task_struct, so it is a per-task attribute and
>> behaves similar to utilization of the task in some ways.
>> - Employ a counting-based tracking strategy that only boosts as long
>> as it sees benefits and returns to no boosting dynamically.
> 
> Thanks for working on improving IO boosting. I have started to read
> your patchset and have few comments about your proposal:
> 
> The main one is that the io boosting decision should remain a cpufreq
> governor decision and so the io boosting value should be applied by
> the governor like in sugov_effective_cpu_perf() as an example instead
> of everywhere in the scheduler code
Having it move into the scheduler is to enable it for EAS (e.g. boosting
a LITTLE to it's highest OPP often being much less energy-efficient than
having a higher cap CPU at a lower OPP) and to enable higher capacities
reachable on other CPUs, too.
I guess for you the first one is the more interesting one.

> 
> Then, the algorithm to track the right interval bucket and the mapping
> of intervals into utilization really looks like a policy which has
> been defined with heuristics and as a result further seems to be a
> governor decision

I did have a comparable thing as a governor decision, but the entire
"Test if util boost increases iowaits seen per interval and only boost
accordingly" really only works if the interval is long enough, my proposed
starting length of 25ms really being the lower limit for the storage devices
we want to cover (IO latency not being constant and therefore iowaits per
interval being somewhat noisy).
Given that the IO tasks will be enqueued/dequeued very frequently it just
isn't credible to expect them to land on the same CPU for many intervals,
unless your system is very bare-bones and idle, but even on an idle Android
I see any interval above 50ms to be unusable and not provide any throughput
improvement.
The idea of tracking the iowaits I do find the best option in this vague and
noisy environment of "iowait wakeups" and definitely worth having, so that's
why I opted for it being in the scheduler code, but I'd love to hear your
thoughts/alternatives.
I'd also like an improvement on the definition of iowait or some more separate
flag for boostable IO, the entire "boost on any iowait wakeup" is groping in
the dark which I'm trying to combat, but it's somewhat out of scope here.

> 
> Finally adding some atomic operation in the fast path is not really desirable
Agreed, I'll look into it, for now I wanted as much feedback on the two major
changes:
- iowait boost now per-task
- boosting based upon iowaits seen per interval

> 
> I will continue to review your patchset

Thank you, looking forward to seeing your review.

>>[snip]
Kind Regards,
Christian