Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

Qais Yousef <qyousef@xxxxxxxxxxx> · Mon, 25 Mar 2024 02:20:51 +0000

(piggy backing on this reply)

On 03/22/24 19:08, Vincent Guittot wrote:
> Hi Christian,
> 
> On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@xxxxxxx> wrote:
> >
> > There is a feature inside of both schedutil and intel_pstate called
> > iowait boosting which tries to prevent selecting a low frequency
> > during IO workloads when it impacts throughput.
> > The feature is implemented by checking for task wakeups that have
> > the in_iowait flag set and boost the CPU of the rq accordingly
> > (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
> >
> > The necessity of the feature is argued with the potentially low
> > utilization of a task being frequently in_iowait (i.e. most of the
> > time not enqueued on any rq and cannot build up utilization).
> >
> > The RFC focuses on the schedutil implementation.
> > intel_pstate frequency selection isn't touched for now, suggestions are
> > very welcome.
> > Current schedutil iowait boosting has several issues:
> > 1. Boosting happens even in scenarios where it doesn't improve
> > throughput. [1]
> > 2. The boost is not accounted for in EAS: a) feec() will only consider
> >  the actual utilization for task placement, but another CPU might be
> >  more energy-efficient at that capacity than the boosted one.)
> >  b) When placing a non-IO task while a CPU is boosted compute_energy()
> >  will not consider the (potentially 'free') boosted capacity, but the
> >  one it would have without the boost (since the boost is only applied
> >  in sugov).
> > 3. Actual IO heavy workloads are hardly distinguished from infrequent
> > in_iowait wakeups.
> > 4. The boost isn't associated with a task, it therefore isn't considered
> > for task placement, potentially missing out on higher capacity CPUs on
> > heterogeneous CPU topologies.
> > 5. The boost isn't associated with a task, it therefore lingers on the
> > rq even after the responsible task has migrated / stopped.
> > 6. The boost isn't associated with a task, it therefore needs to ramp
> > up again when migrated.
> > 7. Since schedutil doesn't know which task is getting woken up,
> > multiple unrelated in_iowait tasks might lead to boosting.

You forgot an important problem which what was the main request from Android
when this first came up few years back. iowait boost is a power hungry
feature and not all tasks require iowait boost. By having it per task we want
to be able to prevent tasks from causing frequency spikes due to iowait boost
when it is not warranted.

> >
> > We attempt to mitigate all of the above by reworking the way the
> > iowait boosting (io boosting from here on) works in two major ways:
> > - Carry the boost in task_struct, so it is a per-task attribute and
> > behaves similar to utilization of the task in some ways.
> > - Employ a counting-based tracking strategy that only boosts as long
> > as it sees benefits and returns to no boosting dynamically.
> 
> Thanks for working on improving IO boosting. I have started to read
> your patchset and have few comments about your proposal:
> 
> The main one is that the io boosting decision should remain a cpufreq
> governor decision and so the io boosting value should be applied by
> the governor like in sugov_effective_cpu_perf() as an example instead
> of everywhere in the scheduler code.

I have similar thoughts.

I think we want the scheduler to treat iowait boost like uclamp_min, but
requested by block subsystem rather than by the user.

I think we should create a new task_min/max_perf() and replace all current
callers in scheduler to uclamp_eff_value() with task_min/max_perf() where
task_min/max_perf()

unsigned long task_min_perf(struct task_struct *p)
{
	return max(uclamp_eff_value(p, UCLAMP_MIN), p->iowait_boost);
}

unsigned long task_max_perf(struct task_struct *p)
{
	return uclamp_eff_value(p, UCLAMP_MAX);
}

then all users of uclamp_min in the scheduler will see the request for boost
from iowait and do the correct task placement decision. Including under thermal
pressure and ensuring that they don't accidentally escape uclamp_max which I am
not sure if your series caters for with the open coding it. You're missing the
load balancer paths from what I see.

It will also solve the problem I mention above. The tasks that should not use
iowait boost are likely restricted with uclamp_max already. If we treat iowait
boost as an additional source of min_perf request, then uclamp_max will prevent
it from going above a certain perf level and give us the desired impact without
any additional hint. I don't think it is important to disable it completely but
rather have a way to prevent tasks from consuming too much resources when not
needed, which we already have from uclamp_max.

I am not sure it makes sense to have a separate control where a task can run
fast due to util but can't have iowait boost or vice versa. I think existing
uclamp_max should be enough to restrict tasks from exceeding a performance
limit.

> 
> Then, the algorithm to track the right interval bucket and the mapping
> of intervals into utilization really looks like a policy which has
> been defined with heuristics and as a result further seems to be a
> governor decision

Hmm do you think this should not be a per-task value then Vincent?

Or oh, I think I see what you mean. Make effective_cpu_util() set min parameter
correctly. I think that would work too, yes. iowait boost is just another min
perf request and as long as it is treated as such, it is good for me. We'll
just need to add a new parameter for the task like I did in remove uclamp max
aggregation serires.

Generally I think it's better to split the patches so that the conversion to
iowait boost with current algorithm to being per-task as a separate patch. And
then look at improving the algorithm logic on top. These are two different
problems IMHO.

One major problem and big difference in per-task iowait that I see Christian
alluded to is that the CPU will no longer be boosted when the task is sleeping.
I think there will be cases out there where some users relied on that for the
BLOCK softirq to run faster too. We need an additional way to ensure that the
softirq runs at a similar performance level to the task that initiated the
request. So we need a way to hold the cpufreq policy's min perf until the
softirq is serviced. Or just keep the CPU boosted until the task is migrated.
I'm not sure what is better yet.

> 
> Finally adding some atomic operation in the fast path is not really desirable

Yes I was thinking if we can apply the value when we set the p->in_iowait flag
instead?