Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"

Qais Yousef <qyousef@xxxxxxxxxxx> · Sun, 1 Sep 2024 18:25:28 +0100

On 08/13/24 17:20, Christian Loehle wrote:
> On 8/9/24 01:23, Qais Yousef wrote:
> > On 08/05/24 11:18, Christian Loehle wrote:
> > 
> >>> My understanding of rq_affinity=1 is to match the perf of requester. Given that
> >>> the characteristic of HMP system is that power has an equal importance to perf
> >>> (I think this now has become true for all systems by the way), saying that the
> >>> match in one direction is better than the other is sort of forcing a policy of
> >>> perf first which I don't think is a good thing to enforce. We don't have enough
> >>> info to decide at this level. And our users care about both.
> >>
> >> I would argue rq_affinity=1 matches the perf, so that flag should already bias
> >> perf in favor of power slightly?
> > 
> > Not on this type of systems. If perf was the only thing important, just use
> > equally big cpus. Balancing perf and power is important on those systems, and
> > I don't think we have enough info to decide which decision is best when
> > capacities are not the same. Matching the perf level the requesting on makes
> > sense when irq_affinity=1.
> 
> Well you could still want a
> "IO performance always beats power considerations" and still go HMP because
> sometimes for non-IO you prefer power, but I agree that we don't have enough
> information about what the user wants from the system/kernel.

The flag was to keep requester and completion on the same LLC. Note the keyword
same. Assume L3 don't exist and L2 is LLC which means capacity level and LLC
mapping to cluster is the same.

Then note that the capability of the LLC is different since the big clusters
tend to have larger L2. I think modern SoCs for servers can end up with complex
LLC not all of which have the same performance.

Keeping requester/completion on the same capacity keeps the old behavior. To
teach it to do more then we need sensible extensions based on sensible use
cases. And not hack rq_affinity=1 to do complex things.

> 
> > 
> >> Although the actual effect on power probably isn't that significant, given
> >> that the (e.g. big) CPU has submitted the IO, is woken up soon, so you could
> >> almost ignore a potential idle wakeup and the actual CPU time spent in the block
> >> completion is pretty short of course.
> >>
> >>> If no matching is required, it makes sense to set rq_affinity to 0. When
> >>> matching is enabled, we need to rely on per-task iowait boost to help the
> >>> requester to run at a bigger CPU, and naturally the completion will follow when
> >>> rq_affinity=1. If the requester doesn't need the big perf, but the irq
> >>> triggered on a bigger core, I struggle to understand why it is good for
> >>> completion to run on bigger core without the requester also being on a similar
> >>> bigger core to truly maximize perf.
> >>
> >> So first of all, per-task iowait boosting has nothing to do with it IMO.
> > 
> > It has. If the perf is not good because the requester is running on little
> > core, the requester need to move up to ensure the overall IO perf is better.
> 
> See below but also
> "the requester need to move up to ensure the overall IO perf is better" is
> just not true, with asynchronous IO submission done right, the submission

I think it is true for rq_affinity=1, which is the context of the discussion
and the patch.

> runtime isn't critical to the IO throughput, therefore it should run the
> most power-efficient way.
> This can be observed e.g. with any io_uring fio workload with significant
> iodepth (and possibly multi-threading).
> Completion may be a different story, depending on the device stack, if we're
> dealing with !MCQ then the completion path (irq + block layer completion)
> is absolutely critical.
> For any mmc / ufs<4.0 system the performance difference between
> fio --name=little --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$LITTLE_CPUS
> and
> fio --name=big --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$BIG_CPUS
> is (usually) only because of the completion path and setting irq affinity of
> /dev/sda to $BIG_CPUS will make the difference disappear (rq_affinity=0 and
> implying LLC is the same).
> Running the submission on little CPUs will usually be the most power-efficient
> way then.