On Mon, Oct 14, 2019 at 5:21 AM Hou Tao <houtao1@xxxxxxxxxx> wrote: > > For network stack, RPS, namely Receive Packet Steering, is used to > distribute network protocol processing from hardware-interrupted CPU > to specific CPUs and alleviating soft-irq load of the interrupted CPU. > > For block layer, soft-irq (for single queue device) or hard-irq > (for multiple queue device) is used to handle IO completion, so > RPS will be useful when the soft-irq load or the hard-irq load > of a specific CPU is too high, or a specific CPU set is required > to handle IO completion. > > Instead of setting the CPU set used for handling IO completion > through sysfs or procfs, we can attach an eBPF program to the > request-queue, provide some useful info (e.g., the CPU > which submits the request) to the program, and let the program > decides the proper CPU for IO completion handling. > > Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> ... > > + rcu_read_lock(); > + prog = rcu_dereference_protected(q->prog, 1); > + if (prog) > + bpf_ccpu = BPF_PROG_RUN(q->prog, NULL); > + rcu_read_unlock(); > + > cpu = get_cpu(); > - if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) > - shared = cpus_share_cache(cpu, ctx->cpu); > + if (bpf_ccpu < 0 || !cpu_online(bpf_ccpu)) { > + ccpu = ctx->cpu; > + if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) > + shared = cpus_share_cache(cpu, ctx->cpu); > + } else > + ccpu = bpf_ccpu; > > - if (cpu != ctx->cpu && !shared && cpu_online(ctx->cpu)) { > + if (cpu != ccpu && !shared && cpu_online(ccpu)) { > rq->csd.func = __blk_mq_complete_request_remote; > rq->csd.info = rq; > rq->csd.flags = 0; > - smp_call_function_single_async(ctx->cpu, &rq->csd); > + smp_call_function_single_async(ccpu, &rq->csd); Interesting idea. Not sure whether such programability makes sense from block layer point of view. >From bpf side having a program with NULL input context is a bit odd. We never had such things in the past, so this patchset won't work as-is. Also no-input means that the program choices are quite limited. Other than round robin and random I cannot come up with other cpu selection ideas. I suggest to do writable tracepoint here instead. Take a look at trace_nbd_send_request. BPF prog can write into 'request'. For your use case it will be able to write into 'bpf_ccpu' local variable. If you keep it as raw tracepoint and don't add the actual tracepoint with TP_STRUCT__entry and TP_fast_assign then it won't be abi and you can change it later or remove it altogether.