Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"

Christian Loehle <christian.loehle@xxxxxxx> · Tue, 13 Aug 2024 17:20:25 +0100

On 8/9/24 01:23, Qais Yousef wrote:
> On 08/05/24 11:18, Christian Loehle wrote:
> 
>>> My understanding of rq_affinity=1 is to match the perf of requester. Given that
>>> the characteristic of HMP system is that power has an equal importance to perf
>>> (I think this now has become true for all systems by the way), saying that the
>>> match in one direction is better than the other is sort of forcing a policy of
>>> perf first which I don't think is a good thing to enforce. We don't have enough
>>> info to decide at this level. And our users care about both.
>>
>> I would argue rq_affinity=1 matches the perf, so that flag should already bias
>> perf in favor of power slightly?
> 
> Not on this type of systems. If perf was the only thing important, just use
> equally big cpus. Balancing perf and power is important on those systems, and
> I don't think we have enough info to decide which decision is best when
> capacities are not the same. Matching the perf level the requesting on makes
> sense when irq_affinity=1.

Well you could still want a
"IO performance always beats power considerations" and still go HMP because
sometimes for non-IO you prefer power, but I agree that we don't have enough
information about what the user wants from the system/kernel.

> 
>> Although the actual effect on power probably isn't that significant, given
>> that the (e.g. big) CPU has submitted the IO, is woken up soon, so you could
>> almost ignore a potential idle wakeup and the actual CPU time spent in the block
>> completion is pretty short of course.
>>
>>> If no matching is required, it makes sense to set rq_affinity to 0. When
>>> matching is enabled, we need to rely on per-task iowait boost to help the
>>> requester to run at a bigger CPU, and naturally the completion will follow when
>>> rq_affinity=1. If the requester doesn't need the big perf, but the irq
>>> triggered on a bigger core, I struggle to understand why it is good for
>>> completion to run on bigger core without the requester also being on a similar
>>> bigger core to truly maximize perf.
>>
>> So first of all, per-task iowait boosting has nothing to do with it IMO.
> 
> It has. If the perf is not good because the requester is running on little
> core, the requester need to move up to ensure the overall IO perf is better.

See below but also
"the requester need to move up to ensure the overall IO perf is better" is
just not true, with asynchronous IO submission done right, the submission
runtime isn't critical to the IO throughput, therefore it should run the
most power-efficient way.
This can be observed e.g. with any io_uring fio workload with significant
iodepth (and possibly multi-threading).
Completion may be a different story, depending on the device stack, if we're
dealing with !MCQ then the completion path (irq + block layer completion)
is absolutely critical.
For any mmc / ufs<4.0 system the performance difference between
fio --name=little --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$LITTLE_CPUS
and
fio --name=big --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$BIG_CPUS
is (usually) only because of the completion path and setting irq affinity of
/dev/sda to $BIG_CPUS will make the difference disappear (rq_affinity=0 and
implying LLC is the same).
Running the submission on little CPUs will usually be the most power-efficient
way then.

> 
>> Plenty of IO workloads build up utilization perfectly fine.
> 
> These ones have no problems, no? They should migrate to big core and the
> completion will follow them when they move.

So if I understood Manish correctly the only reason they want the completion
to run on a bigger CPU than the submission is because the submission is already
saturating the CPU, therefore utilization of submission is no issue whatsoever.
They don't want to run (submission) on big though because of power
considerations.

> 
>> I wouldn't consider the setup: requester little perf, irq+completion big perf
>> invalid necessarily, it does decrease IO latency for the application.
> 
> I didn't say invalid. But it is not something we can guess automatically when
> irq_affinity=1. We don't have enough info to judge. The only info we have the
> requester that originated the request is running at different perf level
> (whther higher or lower), so we follow it.
>
Anyway, Manish's problem should be solved by rq_affinity=0 in that case (with
irq affinities set to big CPU then the completion will be run on the irq CPU)
and "rq_affinity=1 <=> equal capacity CPU" is the correct interpretation, is that
more or less agreed upon now?