Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"

Dietmar Eggemann <dietmar.eggemann@xxxxxxx> · Sun, 11 Aug 2024 19:41:08 +0200

+ linux-kernel@xxxxxxxxxxxxxxx

On 08/08/2024 08:05, MANISH PANDEY wrote:
>  
> On 8/5/2024 11:22 PM, Bart Van Assche wrote:
>> On 8/5/24 10:35 AM, MANISH PANDEY wrote:

[...]

>> Please use an approach that is supported by the block layer. I don't
>> think that dynamically changing the IRQ affinity is compatible with the
>> block layer.
> 
> For UFS with MCQ, ESI IRQs are bounded at the time of initialization.
> so basically i would like to use High Performance cluster CPUs to
> migrate few completions from Mid clusters and take the advantage of high
> capacity CPUs. The new change takes away this opportunity from driver.
> So basically we should be able to use High Performance CPUs like below
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e3c3c0c21b55..a4a2500c4ef6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1164,7 +1164,7 @@ static inline bool blk_mq_complete_need_ipi(struct
> request *rq)
>         if (cpu == rq->mq_ctx->cpu ||
>             (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>              cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
> +            arch_scale_cpu_capacity(cpu) >=     
> arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
>                 return false;
> 
> This way driver can use best possible CPUs for it's use case.

So the issue for you with commit af550e4c9682 seems to be that those
completions don't happen on big CPUs (cpu_capacity = 1024) anymore,
since the condition in  blk_mq_complete_need_ipi() (1):

 if (!QUEUE_FLAG_SAME_FORCE && cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
     cpus_equal_capacity(cpu, rq->mq_ctx->cpu))

is no longer true if 'rq->mq_ctx->cpu != big CPU' so (1) returns true
and blk_mq_complete_request_remote() sends an ipi to 'rq->mq_ctx->cpu'.

I tried to simulate this with a 6 CPUs aarch64 QEMU tri-gear (3
different cpu_capacity values) system:

cat /sys/devices/system/cpu/online
0-5

# cat /sys/devices/system/cpu/cpu*/cpu_capacity
446
446
871
871
1024
1024

# grep -i virtio /proc/interrupts | while read a b; do grep -aH .
/proc/irq/${a%:}/smp_affinity; done
/proc/irq/15/smp_affinity:3f /* block device */
/proc/irq/16/smp_affinity:3f /* network device */

So you set the block device irq affine to the big CPUs (0x30).

# echo 30 > /proc/irq/15/smp_affinity

And with the patch, you send ipi's in blk_mq_complete_request_remote()
in case 'rq->mq_ctx->cpu=[0-4]' whereas w/o the patch or the change to:

 arch_scale_cpu_capacity(cpu) >=
                            arch_scale_cpu_capacity(rq->mq_ctx->cpu) (2)

you would complete the request locally (i.e. on CPU4/5):

gic_handle_irq() -> ... -> handle_irq_event() -> ... -> vm_interrupt()
-> ... -> virtblk_done() (callback) -> blk_mq_complete_request() ->
blk_mq_complete_request_remote(), rq->q->mq_ops->complete(rq)

The patch IMHO was introduced to avoid running local when 'local =
little CPU'. Since you use system knowledge and set IRQ affinity
explicitly to big CPU's to run local on them, maybe (2) is the way to
allow both?