Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"

Qais Yousef <qyousef@xxxxxxxxxxx> · Sun, 1 Sep 2024 18:13:17 +0100

On 08/21/24 17:59, MANISH PANDEY wrote:
> Hi all,
> 
> We agree with the points like rq_affinity can vary from target to target.
> 
> But also please consider below use cases
> 
> 1. Storage devices with single hardware queue (like UFS 2.x/ 3.x)
> in a HMP system with shared LLC, not all the CPUs will be active and online.

Is this a mainline behavior? Hotplug is not considered a normal operation . And
a CPU not being active is not something I am aware of in mainline behavior
aside from hotplug.

I think you're referring to a platform specific out-of-tree feature that is not
part of mainline linux.

> Hence for large amount (say ~1GB) of random IO transactions , if requester
> CPU is from
> smaller cpu group, then due to capacity based grouping, Large cluster CPUs
> would be mostly unused /idle, as the completion would also happen on same
> capacity CPU. Again due to this, the smaller / mid capacity CPUs only would
> have to submit and complete the request. Actually we could have completed
> the requested on large capacity CPUs and could have better utilized the
> power of Large capacity CPUs.

Sorry if I missed your reply elsewhere. But if you don't want the
rq_affinity=1 and do your custom routing, why rq_affinity=0 isn't the right
solution for custom routing behavior?

> 
> But to move IO requests from S/M cluster CPUs to L cluster, scheduler would
> need to wait until a threshold IO hits, and by that time Performance
> application would spent some runtime and would report low performance as
> overall results.

This is not an IO specific problem. Tasks that need more performance will have
to wait for utilization to grow quickly. I don't see this is a problem related
to rq_affinity=1 behavior but a generic scheduler behavior issues that is
outside of block layer.

> 
> 
> On 08/02/24 10:03, Christian Loehle wrote:
> > So I'm assuming you're seeing something like the following:
> > Some CPU(s) (call them S) are submitting IO, hardirq triggers on
> > S.
> > Before the patch the completion softirq could run on a !S CPU,
> > now it runs on S. Am I then correct in assuming your workload
> > is CPU-bound on S? Would you share some details about the
> > workload, too?
> >
> > What's the capacity of CPU(s) S then?
> 
> 
> Yes.. for few SoCs we follow the same.
> Say an SoC with 3 clusters ( S/M/L), if M CPU(s) are submitting IO requests
> ( which is a large data transfer), then in this case since LLC is shared, we
> can allow L CPU(s) to complete the request via softirq / hardirq. This will
> make sure to avoid contention in submission and completion path.

So in this case you want rq_affinity=0?

> 
> 
> Again consider an SoC with 2 clusters ( 4 cpus of small capacity  / 2 CPUs
> of Mid capacity).
> In such case, if M CPUs are used for IO submersion and completions, then
> these CPU would face heavy work load and hence Performance will be impacted.
> Now if the new patch was not there, we could have moved few IO completions
> to S cluster CPUs.

And on this different system you want rq_affinity=1?

I still don't see why you must use rq_affinity=1 for your custom routing which
AFAICT is outside of the supported realm. We have no clue this is contention or
not. It seems you have other larger logic that does custom software in many
layers and I don't think what you're doing is part of what's supported.

Could you help contributing patches to help detect contention problems and how
we can differentiate between the two scenarios instead of hacking rq_affinity=1
?

> 
> > If no matching is required, it makes sense to set rq_affinity to 0.
> 
> > I don't get why irq_affinity=1 is compatible with this case? Isn't
> > this custom
> >setup is a fully managed system by you and means you want >rq_affinity=0?
> What
> >do you lose if you move to rq_affinity=0?
> 
> actually rq_affinity=0 would help for few SoC's having MCQ like UFS 4.x, but
> even this won't be generic solution for us. As this won't considers an SoC
> which doesn't shares LLC, and thus would have significant performance issue.
> Also since the change is picked up in all the kernel branches, so it would
> be very difficult to experiment on older Socs and get the best solution for
> each target.

This is a userspace knob. I don't think the expectation to ship one
configuration for all SoCs...

> 
> This won't work for many SoCs like above example of 2 clusters, as
> rq_affinity to 0 would then complete the request on hardIRQ context, which
> may not be suited for all the SoCs.
> Also not all SoCs are arm based and shares LLC.

Why you can't modify the value based on the SoC? The only thing I am gathering
from your arguments is that you can't modify rq_affinity to suit the setup the
SoC requires, but I don't get why.

> 
> 
> 
> 2. Storage devices with MCQ (Like UFS 4.x / NVME), usages ESI/MSI
> interrupts, hence we would have opportunity to bind ESI IRQ associated with
> an CQ.
> 
> 
> On 08/05/24 10:17, Bart Van Assche wrote:
> > On 8/4/24 7:07 PM, Qais Yousef wrote:
> > > irqbalancers usually move the interrupts, and I'm not sure we can
> > > make an assumption about the reason an interrupt is triggering on
> > > different capacity CPU.
> > User space software can't modify the affinity of managed interrupts.
> > From include/linux/irq.h:
> 
> > > >True. But this is special case and was introduced for isolated
> > > > CPUs. I don't think drivers can request this themselves
> 
> There are drivers, which can manage the cpu affinty for IRQs
> using irq_set_affinity_hint() based on the use case.

Are these in tree drivers? I'd love to learn how the use case is detected
generically if this is in tree.

> Since in the SoC's the LLC is shared ( M and L clusters). So if IO is
> submitted from M capacity CPU(s), and request is completed on L capacity
> cpu(s). Though the capacity of M and L is different, but since the LLC is
> shared, so completion can be done on L cluster without the need of IPI. With
> the new change, we may not get advantage of shared LLC.

How come? The LLC is shared on mid and little. It wouldn't make a difference
which CPU handles it to take advantage of LLC. What did I miss?

> 
> Proposed solution:
> We can have a solution which is backward compatible and a new control flag
> (QUEUE_FLAG_SAME_CAPACITY_FORCE) could be provided for this new change to
> maintain backward compatibly.
> 
> How about introducing a new rq_affinity ( may be rq_affinity = 3) for using
> cpus_equal_capacity() using new flag QUEUE_FLAG_SAME_CAPACITY.
> 
> 
> if (cpu == rq->mq_ctx->cpu ||
> 	(!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
> 	  cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> +	  (test_bit(QUEUE_FLAG_CPU_CAPACITY, &rq->q->queue_flags))
> 	   && cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
> 		return false;
> 
> 
> 
> Could you please consider raising similar change, if this seems fine for
> all.

This is a hack to workaround the fact that you seem to not want or for some
reason I am unable to get yet can't change rq_affinity setup to suit the need
of the platform.

AFAICT we don't have any knowledge about contentions to automatically decide
the best way to route handling completion. It'd be a welcome addition if you
have a solution that can teach the system to handle this automatically. But in
your case it seems you have a lot of custom setup that is not part of usual
upstream handling and would like just to keep backward compatibility for
reasons I am unable to make sense of yet :(