On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote: > Isolated CPUs are removed from queue mapping in this patchset, when someone > submit IOs from the isolated CPU, what is the correct hctx used for handling > these IOs? No, every possible CPU gets a mapping. What this patch series does, is to limit/aligns the number of hardware context to the number of housekeeping CPUs. There is still a complete ctx-hctc mapping. So whenever an user thread on an isolated CPU is issuing an IO a housekeeping CPU will also be involved (with the additional overhead, which seems to be okay for these users). Without hardware queue on the isolated CPUs ensures we really never get any unexpected IO on those CPUs unless userspace does it own its own. It's a safety net. Just to illustrate it, the non isolcpus configuration (default) map for an 8 CPU setup: queue mapping for /dev/vda hctx0: default 0 hctx1: default 1 hctx2: default 2 hctx3: default 3 hctx4: default 4 hctx5: default 5 hctx6: default 6 hctx7: default 7 and with isolcpus=io_queue,2-3,6-7 queue mapping for /dev/vda hctx0: default 0 2 hctx1: default 1 3 hctx2: default 4 6 hctx3: default 5 7 > From current implementation, it depends on implied zero filled > tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used. > > During CPU offline, in blk_mq_hctx_notify_offline(), > blk_mq_hctx_has_online_cpu() returns true even though the last cpu in > hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in > hctx 0 won't be drained. > > However managed irq core code still shutdowns the hw queue's irq because all > CPUs in this hctx are offline now. Then IO hang is triggered, isn't > it? Thanks for the explanation. I was able to reproduce this scenario, that is a hardware context with two CPUs which go offline. Initially, I used fio for creating the workload but this never hit the hanger. Instead some background workload from systemd-journald is pretty reliable to trigger the hanger you describe. Example: hctx2: default 4 6 CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined: smpboot: CPU 5 is now offline blk_mq_hctx_has_online_cpu:3537 hctx3 offline blk_mq_hctx_has_online_cpu:3537 hctx2 offline and there is no forward progress anymore, the cpuhotplug state machine is blocked and an IO is hanging: # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0 /sys/kernel/debug/block/vda/hctx2/tags:busy=61 and blk_mq_hctx_notify_offline busy loops forever: task:cpuhp/6 state:D stack:0 pid:439 tgid:439 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x79d/0x15c0 ? lockdep_hardirqs_on_prepare+0x152/0x210 ? kvm_sched_clock_read+0xd/0x20 ? local_clock_noinstr+0x28/0xb0 ? local_clock+0x11/0x30 ? lock_release+0x122/0x4a0 schedule+0x3d/0xb0 schedule_timeout+0x88/0xf0 ? __pfx_process_timeout+0x10/0x10d msleep+0x28/0x40 blk_mq_hctx_notify_offline+0x1b5/0x200 ? cpuhp_thread_fun+0x41/0x1f0 cpuhp_invoke_callback+0x27e/0x780 ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10 ? cpuhp_thread_fun+0x42/0x1f0 cpuhp_thread_fun+0x178/0x1f0 smpboot_thread_fn+0x12e/0x1c0 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0xe8/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x33/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> I don't think this is a new problem this code introduces. This problem exists for any hardware context which has more than one CPU. As far I understand it, the problem is that there is no forward progress possible for the IO itself (I assume the corresponding resources for the CPU going offline have already been shutdown, thus no progress?) and blk_mq_hctx_notifiy_offline isn't doing anything in this scenario. Couldn't we do something like: +static bool blk_mq_hctx_timeout_rq(struct request *rq, void *data) +{ + blk_mq_rq_timed_out(rq); + return true; +} + +static void blk_mq_hctx_timeout_rqs(struct blk_mq_hw_ctx *hctx) +{ + struct blk_mq_tags *tags = hctx->sched_tags ? + hctx->sched_tags : hctx->tags; + blk_mq_all_tag_iter(tags, blk_mq_hctx_timeout_rq, NULL); +} + + static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) { struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online); + int i; if (blk_mq_hctx_has_online_cpu(hctx, cpu)) return 0; @@ -3551,9 +3589,16 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) * requests. If we could not grab a reference the queue has been * frozen and there are no requests. */ + i = 0; if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) { - while (blk_mq_hctx_has_requests(hctx)) + while (blk_mq_hctx_has_requests(hctx) && i++ < 10) msleep(5); + if (blk_mq_hctx_has_requests(hctx)) { + pr_info("%s:%d hctx %d force timeout request\n", + __func__, __LINE__, hctx->queue_num); + blk_mq_hctx_timeout_rqs(hctx); + } + This guarantees forward progress and it worked in my test scenario, got the corresponding log entries blk_mq_hctx_notify_offline:3598 hctx 2 force timeout request and the hotplug state machine continued. Didn't see an IO error either, but I haven't looked closely, this is just a POC. BTW, when looking at the tag allocator, I didn't see any hctx state checks for the batched alloction path. Don't we need to check if the corresponding hardware context is active there too? @ -486,6 +487,15 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) if (data->nr_tags > 1) { rq = __blk_mq_alloc_requests_batch(data); if (rq) { + if (unlikely(test_bit(BLK_MQ_S_INACTIVE, + &data->hctx->state))) { + blk_mq_put_tag(blk_mq_tags_from_data(data), + rq->mq_ctx, rq->tag); + msleep(3); + goto retry; + } blk_mq_rq_time_init(rq, alloc_time_ns); return rq; } But given this is the hotpath and the hotplug path is very unlikely to be used at all, at least for the majority of users, I would suggest to try to get blk_mq_hctx_notify_offline to guarantee forward progress?. This would make the hotpath an 'if' less.