Hi all, I found a random high CPU utilization issue with some database benchmark program running on a 192-CPU virtual machine (VM). Originally the issue was found with RHEL 8.4 and Ubuntu 20.04, and further tests show that the issue also reproduces with the latest upstream stable kernel v5.15.7, but *not* with v5.16-rc1. It looks like someone resolved the issue in v5.16-rc1 recently? Details: The database benchmark test starts 1000+ processes, but it is not CPU-intensive. The processes access 24 iSCSI-backed disks (multipath is used), and here no local disk is used). Normally the overall CPU utilization of 192 CPUs should be 2~5%, including %user (less than 1%), %sys (2~5%), and %si (less than 1%) . The iSCSI TCP traffic is not heavy and the packets-per-second (PPS) rate is not high. By default the "none" I/O scheduler is used in the test. The issue is: every 1~2 minutes, there is a >90% CPU utilization spike that lasts 5~20 seconds, and the system's performance drops significantly. When the CPU utilization spike is happening, *each* of the 192 CPUs' %sys goes up to 60+%, and the %si goes up to 30%! Sometimes if I'm not lucky, the >90% CPU utilization happens all the time. This is not a NIC driver issue becaue the NIC interrupt rate is low due to NAPI, and the PPS rate is also low (only about 250K/s) and the network traffic is also low (typically only 2~5 Gbps. The limit for the VM is 30 Gbps, and the NIC driver itself is able to reach a throughput of 20+Gbps easily, if the database benchmark test is not running). The NIC driver uses 31 queues by default. When the CPU utilization spike is happening, if I configure the NIC to use only 1 queue by "ethtool -L eth0 combined 1", the CPU utilization goes down to normal immediately, and the network throughput and PPS rate go up; if I use 2 queues, the CPU utilization is also not normal (and it consistently stays at this high CPU utilization -- see [1]), but is not so high as the 31-queue case. In the case of [1], the iSCSI TCP packets are processed on CPU 7 and CPU 30 ("cat /proc/irq/57/smp_affinity_list" returns "7", and "cat /proc/irq/58/smp_affinity_list" returns "30") on which the NIC's MSI-X interrupts (IRQ 57 and 58) are bound to. If I check the hottest functions on CPU 7 and CPU 30, e.g. by "perf top -C 7 --call-graph dwarf" (see [2]), it looks like the code path blk_done_softirq() -> ... -> blk_mq_sched_restart() -> ... -> kblockd_mod_delayed_work_on() -> ... -> try_to_grab_pending() is even hotter than the NIC TX/RX handler and the TCP/IP stack! Since the other 190 CPUs also have high %sys and %si (see [3]), I check "perf top -C 0 --call-graph dwarf" and find that the same blk_done_softirq() -> ... -> try_to_grab_pending() code path is very hot. My feeling is that it looks like the iSCSI hardware queues are experiencing the transition "not full -> full -> not full -> full -> ... " very frequently and the blk-mq softirq handler is called very frequently, trying to submit new I/O requests? try_to_grab_pending() acquires and releases the pool->lock spinlock, and I guess there may be a lock contention issue?? If I change the NIC's second IRQ's affinity from CPU 30 to to CPU 7 as well (i.e. both the NIC queues are handled by the same CPU 7) by " echo 7 > /proc/irq/58/smp_affinity_list", the high CPU utilization issue goes away immediately (i.e. the %sys and %si numbers on each CPU rather than CPU7 go down immediately to normal, i.e. %sys = 1.9%, %si = 0.6%); if I change the second IRQ's affinity back to CPU 30, the high CPU utilization happens again immediately. Any insights into this significant difference when blk_done_softirq() runs on 1 CPU vs. on 2 CPUs? Here when both the 2 NIC IRQs happen on the same CPU, the network throughput and PPS rate are actually better. When the high CPU utilization issue is happening, if I change the I/O scheduler from "noop" to "mq-deadline", basicaly the issue will also go away immediately (%sys = 2.7%, %si = 1.4%. This is slightly higher than the case where "noop" is used and both the 2 NIC IRQs happen on the same CPU). It's unclear why the issue doesn't reproduce with v5.16-rc1. The kernel difference from v5.15.7 (the bad one) is huge (1+ million lines of changes) and a weird file system hang issue (if I use a kernel git commit in between, I/O requests on the iSCSI disks can not complete at all -- see [5]) prevents me from doing git-bisect. There are about 100 patches related to blk-mq, and at first glance I have not found anything that can explain why the issue does not reproduce with v5.16-rc1. Thanks, -- Dexuan [1] The snippet of the "top" command ( v5.15.7 stable kernel; the NIC uses 2 queues) top - 01:49:14 up 31 min, 2 users, load average: 1052.88, 999.40, 706.60 Tasks: 2862 total, 207 running, 2654 sleeping, 1 stopped, 0 zombie %Cpu(s): 0.1 us, 35.2 sy, 0.0 ni, 1.1 id, 49.6 wa, 0.0 hi, 14.0 si, 0.0 st MiB Mem : 2064232.+total, 2053872.+free, 9031.1 used, 1328.6 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 2049269.+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 195 root 20 0 0 0 0 R 92.7 0.0 13:37.47 ksoftirqd/30 57 root 20 0 0 0 0 R 91.4 0.0 14:22.18 ksoftirqd/7 11649 root 20 0 18624 3544 1536 R 70.3 0.0 9:02.73 GenerateIO.Linu 11654 root 20 0 18624 3544 1536 R 69.8 0.0 9:05.26 GenerateIO.Linu 11650 root 20 0 18624 3544 1536 R 69.0 0.0 9:00.95 GenerateIO.Linu 11655 root 20 0 18624 3544 1536 R 68.9 0.0 8:39.59 GenerateIO.Linu 11652 root 20 0 18624 3544 1536 R 68.8 0.0 8:37.07 GenerateIO.Linu 11656 root 20 0 18624 3544 1536 R 68.5 0.0 8:40.68 GenerateIO.Linu 11651 root 20 0 18624 3544 1536 R 68.2 0.0 8:38.72 GenerateIO.Linu 11653 root 20 0 18624 3544 1536 S 67.5 0.0 8:39.97 GenerateIO.Linu 869 root 20 0 0 0 0 S 10.2 0.0 1:32.12 ksoftirqd/142 580 root 20 0 0 0 0 R 10.1 0.0 1:28.62 ksoftirqd/94 629 root 20 0 0 0 0 S 10.1 0.0 1:28.80 ksoftirqd/102 436 root 20 0 0 0 0 S 10.0 0.0 1:27.72 ksoftirqd/70 556 root 20 0 0 0 0 S 10.0 0.0 1:29.44 ksoftirqd/90 316 root 20 0 0 0 0 S 9.9 0.0 1:29.45 ksoftirqd/50 592 root 20 0 0 0 0 S 9.9 0.0 1:32.97 ksoftirqd/96 990 root 20 0 0 0 0 R 9.8 0.0 1:29.08 ksoftirqd/162 ... [2] "perf top -C 7 --call-graph dwarf -d 30" (only the first lines of the output are shown here) Samples: 1M of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 246356121644 lost: 0/47 drop: 0/0 Children Self Shared Object Symbol - 98.01% 0.03% [kernel] [k] __softirqentry_text_start - 12.10% __softirqentry_text_start - 7.10% blk_done_softirq - 7.22% blk_complete_reqs - 6.52% dm_softirq_done - 6.55% dm_done - 6.52% blk_mq_end_request - 6.53% blk_mq_free_request - 6.49% __blk_mq_free_request - 6.48% blk_mq_sched_restart - 6.46% blk_mq_run_hw_queue - 6.46% __blk_mq_delay_run_hw_queue - 7.20% kblockd_mod_delayed_work_on - 51.73% mod_delayed_work_on try_to_grab_pending - 4.89% net_rx_action - 4.90% __napi_poll - 3.28% mlx5e_napi_poll - 3.03% mlx5e_poll_rx_cq - 3.85% mlx5e_handle_rx_cqe - 2.65% napi_gro_receive - 2.54% dev_gro_receive - 1.78% napi_gro_complete.constprop.0.isra.0 - 1.80% netif_receive_skb_list_internal - 1.82% __netif_receive_skb_list_core - 1.76% ip_list_rcv - 1.77% ip_sublist_rcv - 1.67% ip_sublist_rcv_finish - 1.70% ip_local_deliver - 1.61% ip_local_deliver_finish - 1.71% ip_protocol_deliver_rcu - 1.81% tcp_v4_rcv - 1.63% tcp_v4_do_rcv - 1.73% tcp_rcv_established - 0.98% tcp_data_ready - 0.97% iscsi_sw_tcp_data_ready - 0.99% tcp_read_sock - 2.00% iscsi_sw_tcp_recv - 2.01% iscsi_tcp_recv_skb 0.99% memcpy_erms - 1.34% mlx5e_skb_from_cqe_linear - 0.86% mlx5e_poll_tx_cq - 1.93% napi_gro_flush - 1.95% napi_gro_complete.constprop.0.isra.0 - 1.95% netif_receive_skb_list_internal - 1.98% __netif_receive_skb_list_core - 1.92% ip_list_rcv - 1.93% ip_sublist_rcv - 1.84% ip_sublist_rcv_finish - 1.87% ip_local_deliver - 1.78% ip_local_deliver_finish - 1.89% ip_protocol_deliver_rcu - 1.97% tcp_v4_rcv - 1.82% tcp_v4_do_rcv - 1.93% tcp_rcv_established - 1.10% tcp_data_ready - 1.10% iscsi_sw_tcp_data_ready - 1.12% tcp_read_sock - 2.59% iscsi_sw_tcp_recv - 2.64% iscsi_tcp_recv_skb 0.78% iscsi_tcp_segment_done 1.38% memcpy_erms + 92.61% 0.01% [kernel] [k] smpboot_thread_fn + 92.58% 0.01% [kernel] [k] run_ksoftirqd + 56.55% 0.00% [kernel] [k] blk_done_softirq + 56.53% 0.02% [kernel] [k] blk_complete_reqs + 52.56% 0.00% [kernel] [k] __blk_mq_delay_run_hw_queue + 52.55% 0.01% [kernel] [k] blk_mq_run_hw_queue + 52.54% 0.00% [kernel] [k] kblockd_mod_delayed_work_on + 52.54% 0.87% [kernel] [k] mod_delayed_work_on + 52.10% 0.00% [kernel] [k] dm_softirq_done - 52.10% 0.01% [kernel] [k] dm_done - 6.54% dm_done - 6.52% blk_mq_end_request - 6.53% blk_mq_free_request - 6.49% __blk_mq_free_request - 6.48% blk_mq_sched_restart - 6.46% blk_mq_run_hw_queue - 6.46% __blk_mq_delay_run_hw_queue - 7.20% kblockd_mod_delayed_work_on - 51.73% mod_delayed_work_on try_to_grab_pending + 51.97% 0.06% [kernel] [k] blk_mq_free_request + 51.88% 0.00% [kernel] [k] blk_mq_end_request + 51.88% 0.04% [kernel] [k] __blk_mq_free_request + 51.77% 0.02% [kernel] [k] blk_mq_sched_restart - 51.67% 51.64% [kernel] [k] try_to_grab_pending - 6.34% ret_from_fork - kthread - 6.06% smpboot_thread_fn - 6.05% run_ksoftirqd - 48.66% __softirqentry_text_start blk_done_softirq blk_complete_reqs dm_softirq_done dm_done blk_mq_end_request blk_mq_free_request __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - 1.53% worker_thread - 1.45% asm_sysvec_hyperv_stimer0 sysvec_hyperv_stimer0 irq_exit_rcu __softirqentry_text_start blk_done_softirq blk_complete_reqs dm_softirq_done dm_done blk_mq_end_request blk_mq_free_request __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending + 40.07% 0.02% [kernel] [k] net_rx_action + 40.04% 0.01% [kernel] [k] __napi_poll + 33.42% 0.06% [kernel] [k] netif_receive_skb_list_internal + 33.23% 0.05% [kernel] [k] __netif_receive_skb_list_core + 32.96% 0.04% [kernel] [k] ip_list_rcv + 32.87% 0.05% [kernel] [k] ip_sublist_rcv + 31.34% 0.05% [kernel] [k] ip_sublist_rcv_finish + 31.27% 0.06% [kernel] [k] ip_local_deliver + 30.69% 0.02% [kernel] [k] ip_local_deliver_finish + 30.66% 0.04% [kernel] [k] ip_protocol_deliver_rcu + 30.57% 0.26% [kernel] [k] tcp_v4_rcv + 30.20% 0.03% [kernel] [k] napi_gro_complete.constprop.0.isra.0 + 29.79% 0.03% [kernel] [k] tcp_v4_do_rcv + 29.68% 0.15% [kernel] [k] tcp_rcv_established + 25.13% 0.04% [kernel] [k] tcp_data_ready + 25.05% 0.06% [kernel] [k] iscsi_sw_tcp_data_ready + 24.94% 0.05% [kernel] [k] tcp_read_sock + 22.24% 0.02% [kernel] [k] mlx5e_napi_poll + 21.04% 0.05% [kernel] [k] iscsi_sw_tcp_recv + 20.26% 0.30% [kernel] [k] mlx5e_poll_rx_cq + 19.93% 0.18% [kernel] [k] mlx5e_handle_rx_cqe + 17.35% 0.76% [kernel] [k] iscsi_tcp_recv_skb + 17.31% 0.07% [kernel] [k] napi_gro_receive + 16.60% 0.03% [kernel] [k] napi_gro_flush + 14.88% 0.32% [kernel] [k] dev_gro_receive + 12.23% 0.00% [kernel] [k] ret_from_fork + 12.23% 0.00% [kernel] [k] kthread + 10.17% 0.04% [kernel] [k] iscsi_complete_pdu + 10.06% 0.23% [kernel] [k] __iscsi_complete_pdu + 9.33% 0.03% [kernel] [k] iscsi_complete_task + 8.95% 0.52% [kernel] [k] __iscsi_put_task + 8.41% 0.03% [kernel] [k] blk_mq_complete_reque [3] The first lines of the "top" output are shown here: top - 02:09:58 up 52 min, 2 users, load average: 1050.01, 1038.20, 951.19 Tasks: 2862 total, 38 running, 2823 sleeping, 1 stopped, 0 zombie %Cpu0 : 1.1 us, 37.0 sy, 0.0 ni, 1.3 id, 48.6 wa, 0.0 hi, 12.0 si, 0.0 st %Cpu1 : 1.6 us, 26.1 sy, 0.0 ni, 7.9 id, 59.7 wa, 0.0 hi, 4.7 si, 0.0 st %Cpu2 : 3.7 us, 34.4 sy, 0.0 ni, 1.3 id, 51.0 wa, 0.0 hi, 9.6 si, 0.0 st %Cpu3 : 1.1 us, 24.6 sy, 0.0 ni, 8.0 id, 61.6 wa, 0.0 hi, 4.7 si, 0.0 st %Cpu4 : 0.8 us, 36.0 sy, 0.0 ni, 0.4 id, 53.7 wa, 0.0 hi, 9.1 si, 0.0 st %Cpu5 : 0.1 us, 25.9 sy, 0.0 ni, 5.6 id, 63.8 wa, 0.0 hi, 4.6 si, 0.0 st %Cpu6 : 0.7 us, 25.1 sy, 0.0 ni, 2.8 id, 67.3 wa, 0.0 hi, 4.0 si, 0.0 st %Cpu7 : 0.0 us, 2.1 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 97.8 si, 0.0 st %Cpu8 : 0.3 us, 38.2 sy, 0.0 ni, 0.1 id, 52.0 wa, 0.0 hi, 9.3 si, 0.0 st %Cpu9 : 1.0 us, 28.3 sy, 0.0 ni, 1.4 id, 65.1 wa, 0.0 hi, 4.2 si, 0.0 st %Cpu10 : 0.9 us, 38.2 sy, 0.0 ni, 0.5 id, 49.7 wa, 0.0 hi, 10.6 si, 0.0 st %Cpu11 : 0.0 us, 27.0 sy, 0.0 ni, 2.3 id, 66.0 wa, 0.0 hi, 4.7 si, 0.0 st %Cpu12 : 0.1 us, 37.8 sy, 0.0 ni, 1.2 id, 51.5 wa, 0.0 hi, 9.4 si, 0.0 st %Cpu13 : 0.4 us, 27.3 sy, 0.0 ni, 2.9 id, 64.4 wa, 0.0 hi, 5.0 si, 0.0 st %Cpu14 : 0.3 us, 38.2 sy, 0.0 ni, 0.6 id, 52.3 wa, 0.0 hi, 8.6 si, 0.0 st %Cpu15 : 0.0 us, 26.4 sy, 0.0 ni, 2.9 id, 66.4 wa, 0.0 hi, 4.3 si, 0.0 st %Cpu16 : 0.0 us, 36.3 sy, 0.0 ni, 0.5 id, 54.6 wa, 0.0 hi, 8.6 si, 0.0 st %Cpu17 : 0.1 us, 26.8 sy, 0.0 ni, 5.0 id, 64.1 wa, 0.0 hi, 4.0 si, 0.0 st %Cpu18 : 0.4 us, 38.2 sy, 0.0 ni, 0.8 id, 52.6 wa, 0.0 hi, 8.0 si, 0.0 st %Cpu19 : 0.3 us, 24.5 sy, 0.0 ni, 6.0 id, 63.8 wa, 0.0 hi, 5.4 si, 0.0 st %Cpu20 : 0.9 us, 37.0 sy, 0.0 ni, 0.6 id, 53.1 wa, 0.0 hi, 8.4 si, 0.0 st [4] "perf top -C 0 --call-graph dwarf -d 30" (only the first lines of the output are shown here) Samples: 63K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 15949750000 lost: 77/77 drop: 0/0 Children Self Shared Object Symbol - 85.84% 0.01% [kernel] [k] __blk_mq_delay_run_hw_queue - 85.83% __blk_mq_delay_run_hw_queue - 69.75% __blk_mq_run_hw_queue - blk_mq_sched_dispatch_requests - 68.62% __blk_mq_sched_dispatch_requests - 42.70% blk_mq_dispatch_rq_list - 26.92% __blk_mq_delay_run_hw_queue - kblockd_mod_delayed_work_on - 26.91% mod_delayed_work_on - try_to_grab_pending - 4.03% asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu - __softirqentry_text_start - blk_done_softirq blk_complete_reqs dm_softirq_done - dm_done - 3.81% blk_mq_end_request - blk_mq_free_request - __blk_mq_free_request blk_mq_sched_restart - blk_mq_run_hw_queue - 3.81% __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on try_to_grab_pending - 15.03% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on - 14.82% try_to_grab_pending - 2.41% asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu - __softirqentry_text_start - blk_done_softirq - blk_complete_reqs - dm_softirq_done dm_done blk_mq_end_request - blk_mq_free_request - 2.30% __blk_mq_free_request - blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on try_to_grab_pending - 0.72% dm_mq_queue_rq 0.60% map_request - 25.85% blk_mq_do_dispatch_ctx - 25.36% blk_mq_dispatch_rq_list - 13.48% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue - kblockd_mod_delayed_work_on - 13.48% mod_delayed_work_on - 13.27% try_to_grab_pending - 2.21% asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu - __softirqentry_text_start - blk_done_softirq - blk_complete_reqs - dm_softirq_done - 2.09% dm_done - 2.09% blk_mq_end_request - 2.09% blk_mq_free_request - __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on try_to_grab_pending - 11.07% __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on - 11.02% try_to_grab_pending - 1.89% asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu - __softirqentry_text_start - blk_done_softirq blk_complete_reqs dm_softirq_done - dm_done - 1.81% blk_mq_end_request - blk_mq_free_request - 1.81% __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on try_to_grab_pending - 0.77% dm_mq_queue_rq 0.59% map_request - 1.12% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on 1.10% try_to_grab_pending - 16.09% kblockd_mod_delayed_work_on - mod_delayed_work_on try_to_grab_pending + 85.80% 0.01% [kernel] [k] blk_mq_run_hw_queue + 84.18% 0.00% [kernel] [k] kblockd_mod_delayed_work_on + 84.17% 0.95% [kernel] [k] mod_delayed_work_on - 83.36% 82.52% [kernel] [k] try_to_grab_pending - 54.58% pread64 (inlined entry_SYSCALL_64_after_hwframe do_syscall_64 - __x64_sys_pread64 - 54.57% vfs_read new_sync_read xfs_file_read_iter xfs_file_dio_read - iomap_dio_rw - 54.55% __iomap_dio_rw blk_finish_plug blk_flush_plug_list blk_mq_flush_plug_list - blk_mq_sched_insert_requests - 53.64% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue __blk_mq_run_hw_queue - blk_mq_sched_dispatch_requests - 52.67% __blk_mq_sched_dispatch_requests - 32.38% blk_mq_dispatch_rq_list - 19.96% __blk_mq_delay_run_hw_queue - kblockd_mod_delayed_work_on - 19.96% mod_delayed_work_on - try_to_grab_pending - 2.49% asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu __softirqentry_text_start blk_done_softirq blk_complete_reqs dm_softirq_done dm_done blk_mq_end_request blk_mq_free_request __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 12.37% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on - 12.36% try_to_grab_pending - asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu __softirqentry_text_start blk_done_softirq blk_complete_reqs dm_softirq_done dm_done blk_mq_end_request blk_mq_free_request __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 20.29% blk_mq_do_dispatch_ctx - 20.02% blk_mq_dispatch_rq_list - 10.92% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue - kblockd_mod_delayed_work_on - 10.91% mod_delayed_work_on - 10.90% try_to_grab_pending - asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu __softirqentry_text_start blk_done_softirq blk_complete_reqs dm_softirq_done dm_done blk_mq_end_request blk_mq_free_request __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 9.03% __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on - try_to_grab_pending - asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu __softirqentry_text_start blk_done_softirq blk_complete_reqs dm_softirq_done dm_done blk_mq_end_request blk_mq_free_request __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 0.97% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 0.91% blk_mq_try_issue_list_directly blk_mq_request_bypass_insert blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue __blk_mq_run_hw_queue - blk_mq_sched_dispatch_requests - 0.91% __blk_mq_sched_dispatch_requests - 0.83% blk_mq_dispatch_rq_list - 0.59% __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 11.05% syscall entry_SYSCALL_64_after_hwframe do_syscall_64 - __x64_sys_io_submit - 5.77% blk_finish_plug blk_flush_plug_list blk_mq_flush_plug_list blk_mq_sched_insert_requests blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue __blk_mq_run_hw_queue - blk_mq_sched_dispatch_requests - 5.73% __blk_mq_sched_dispatch_requests - 3.73% blk_mq_dispatch_rq_list - 2.79% __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on - try_to_grab_pending - 0.55% asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu __softirqentry_text_start blk_done_softirq blk_complete_reqs dm_softirq_done dm_done blk_mq_end_request blk_mq_free_request __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 0.93% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 2.00% blk_mq_do_dispatch_ctx - 1.99% blk_mq_dispatch_rq_list - 1.14% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 0.85% __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 5.28% io_submit_one __io_submit_one.constprop.0 aio_write xfs_file_write_iter - xfs_file_dio_write_aligned - 5.26% iomap_dio_rw - __iomap_dio_rw - 5.25% blk_finish_plug blk_flush_plug_list blk_mq_flush_plug_list - blk_mq_sched_insert_requests - 5.07% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue __blk_mq_run_hw_queue - blk_mq_sched_dispatch_requests - 5.01% __blk_mq_sched_dispatch_requests - 3.40% blk_mq_dispatch_rq_list - 2.63% __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on - try_to_grab_pending - asm_sysvec_call_function_single sysvec_call_function_single irq_exit_rcu __softirqentry_text_start blk_done_softirq blk_complete_reqs dm_softirq_done dm_done blk_mq_end_request blk_mq_free_request __blk_mq_free_request blk_mq_sched_restart blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 0.76% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending - 1.62% blk_mq_do_dispatch_ctx - 1.57% blk_mq_dispatch_rq_list - 0.79% blk_mq_run_hw_queue __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on - mod_delayed_work_on 0.79% try_to_grab_pending - 0.77% __blk_mq_delay_run_hw_queue kblockd_mod_delayed_work_on mod_delayed_work_on try_to_grab_pending [5] the "dmesg" snippet for a kernel between v5.15 and v5.16-rc1 ( iSCSI I/O requests can not complete due to some bug. Is this a known issue?): [ 45.991286] sd 21:0:0:1: [sdbp] Optimal transfer size 1048576 bytes [ 45.991347] sd 25:0:0:2: [sdbr] Write Protect is off [ 45.991349] sd 25:0:0:2: [sdbr] Mode Sense: 85 00 10 08 [ 45.991541] sd 23:0:0:1: [sdbq] Write cache: disabled, read cache: enabled, supports DPO and FUA [ 45.991590] sd 22:0:0:2: Attached scsi generic sg71 type 0 [ 45.991746] sd 25:0:0:2: [sdbr] Write cache: disabled, read cache: enabled, supports DPO and FUA [ 45.991877] sd 22:0:0:2: [sdbt] 268435456 512-byte logical blocks: (137 GB/128 GiB) [ 45.992077] sd 23:0:0:1: [sdbq] Optimal transfer size 1048576 bytes [ 45.992093] sd 22:0:0:2: [sdbt] Write Protect is off [ 45.992094] sd 22:0:0:2: [sdbt] Mode Sense: 85 00 10 08 [ 45.992165] sd 25:0:0:2: [sdbr] Optimal transfer size 1048576 bytes [ 45.992394] sd 24:0:0:1: [sdbs] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 45.992426] sd 21:0:0:2: Attached scsi generic sg72 type 0 [ 45.992533] sd 22:0:0:2: [sdbt] Write cache: disabled, read cache: enabled, supports DPO and FUA [ 45.992691] sd 24:0:0:1: [sdbs] Write Protect is off [ 45.992693] sd 21:0:0:2: [sdbu] 268435456 512-byte logical blocks: (137 GB/128 GiB) [ 45.992694] sd 24:0:0:1: [sdbs] Mode Sense: 85 00 10 08 [ 45.992995] sd 22:0:0:2: [sdbt] Optimal transfer size 1048576 bytes [ 45.993000] sd 21:0:0:2: [sdbu] Write Protect is off [ 45.993002] sd 21:0:0:2: [sdbu] Mode Sense: 85 00 10 08 [ 45.993367] sd 24:0:0:1: [sdbs] Write cache: disabled, read cache: enabled, supports DPO and FUA [ 45.993480] sd 23:0:0:2: Attached scsi generic sg73 type 0 [ 45.993603] sd 21:0:0:2: [sdbu] Write cache: disabled, read cache: enabled, supports DPO and FUA [ 45.993993] sd 24:0:0:1: [sdbs] Optimal transfer size 1048576 bytes [ 45.994283] sd 21:0:0:2: [sdbu] Optimal transfer size 1048576 bytes [ 45.994359] sd 23:0:0:2: Power-on or device reset occurred [ 45.995506] sd 24:0:0:2: Attached scsi generic sg74 type 0 [ 45.995754] sd 24:0:0:2: Power-on or device reset occurred [ 45.996180] sd 23:0:0:2: [sdbv] 268435456 512-byte logical blocks: (137 GB/128 GiB) [ 45.996399] sd 23:0:0:2: [sdbv] Write Protect is off [ 45.996402] sd 23:0:0:2: [sdbv] Mode Sense: 85 00 10 08 [ 45.996811] sd 23:0:0:2: [sdbv] Write cache: disabled, read cache: enabled, supports DPO and FUA [ 45.997236] sd 23:0:0:2: [sdbv] Optimal transfer size 1048576 bytes [ 45.998626] sd 24:0:0:2: [sdbw] 268435456 512-byte logical blocks: (137 GB/128 GiB) [ 45.998948] sd 24:0:0:2: [sdbw] Write Protect is off [ 45.998949] sd 24:0:0:2: [sdbw] Mode Sense: 85 00 10 08 [ 45.999562] sd 24:0:0:2: [sdbw] Write cache: disabled, read cache: enabled, supports DPO and FUA [ 46.000155] sd 24:0:0:2: [sdbw] Optimal transfer size 1048576 bytes [ 46.115711] sd 25:0:0:2: [sdbr] Attached SCSI disk [ 46.116112] sd 21:0:0:0: [sdbj] Attached SCSI disk [ 46.116162] sd 22:0:0:0: [sdbk] Attached SCSI disk [ 46.119829] sd 23:0:0:1: [sdbq] Attached SCSI disk [ 46.120044] sd 21:0:0:2: [sdbu] Attached SCSI disk [ 46.120146] sd 21:0:0:1: [sdbp] Attached SCSI disk [ 46.120234] sd 22:0:0:1: [sdbo] Attached SCSI disk [ 46.127808] sd 23:0:0:2: [sdbv] Attached SCSI disk [ 46.127810] sd 25:0:0:1: [sdbn] Attached SCSI disk [ 46.127941] sd 23:0:0:0: [sdbl] Attached SCSI disk [ 46.128571] sd 24:0:0:1: [sdbs] Attached SCSI disk [ 46.128584] sd 24:0:0:2: [sdbw] Attached SCSI disk [ 46.128703] sd 24:0:0:0: [sdbm] Attached SCSI disk [ 46.131549] sd 25:0:0:0: [sdbi] Attached SCSI disk [ 46.132250] sd 22:0:0:2: [sdbt] Attached SCSI disk [ 51.584107] device-mapper: multipath queue-length: version 0.2.0 loaded [ 54.932029] loop6: detected capacity change from 0 to 8 [ 74.779368] hv_balloon: Max. dynamic memory size: 2097152 MB [ 128.762895] sysrq: Show Blocked State [ 128.827260] task:systemd-udevd state:D stack: 0 pid: 4125 ppid: 2402 flags:0x00000000 [ 128.827265] Call Trace: [ 128.827269] __schedule+0xce6/0x13d0 [ 128.827281] ? blk_finish_plug+0x26/0x40 [ 128.827288] ? read_pages+0x1db/0x280 [ 128.827295] schedule+0x4e/0xb0 [ 128.827298] io_schedule+0x3f/0x70 [ 128.827301] wait_on_page_bit_common+0x16e/0x3b0 [ 128.827308] ? filemap_invalidate_unlock_two+0x40/0x40 [ 128.827311] put_and_wait_on_page_locked+0x4f/0x60 [ 128.827313] filemap_get_pages+0x643/0x660 [ 128.827317] filemap_read+0xbb/0x3e0 [ 128.827319] ? refill_obj_stock+0xc8/0x130 [ 128.827324] generic_file_read_iter+0xe5/0x150 [ 128.827327] blkdev_read_iter+0x38/0x70 [ 128.827332] new_sync_read+0x113/0x1a0 [ 128.827341] vfs_read+0xfe/0x1a0 [ 128.827345] ksys_read+0x67/0xe0 [ 128.827347] __x64_sys_read+0x1a/0x20 [ 128.827350] do_syscall_64+0x5c/0xc0 [ 128.827356] ? irqentry_exit_to_user_mode+0x9/0x20 [ 128.827359] ? irqentry_exit+0x19/0x30 [ 128.827361] ? exc_page_fault+0x89/0x160 [ 128.827362] ? asm_exc_page_fault+0x8/0x30 [ 128.827367] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 128.827369] RIP: 0033:0x7fee5a814142 [ 128.827373] RSP: 002b:00007ffee6f253d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 128.827375] RAX: ffffffffffffffda RBX: 00005644e54956a8 RCX: 00007fee5a814142 [ 128.827376] RDX: 0000000000000400 RSI: 00005644e54956b8 RDI: 0000000000000006 [ 128.827378] RBP: 00005644e52edec0 R08: 00005644e5495690 R09: 00007fee5a8ef200 [ 128.827379] R10: 00005644e520c010 R11: 0000000000000246 R12: 0000000000000000 [ 128.827380] R13: 0000000000000400 R14: 00005644e5495690 R15: 00005644e52edf10 [ 128.828282] task:systemd-udevd state:D stack: 0 pid: 5166 ppid: 2402 flags:0x00000000 [ 128.828284] Call Trace: [ 128.828286] __schedule+0xce6/0x13d0 [ 128.828289] ? blk_finish_plug+0x26/0x40 [ 128.828291] ? read_pages+0x1db/0x280 [ 128.828293] schedule+0x4e/0xb0 [ 128.828296] io_schedule+0x3f/0x70 [ 128.828298] wait_on_page_bit_common+0x16e/0x3b0 [ 128.828301] ? filemap_invalidate_unlock_two+0x40/0x40 [ 128.828304] put_and_wait_on_page_locked+0x4f/0x60 [ 128.828307] filemap_get_pages+0x643/0x660 [ 128.828310] ? __cond_resched+0x19/0x30 [ 128.828313] filemap_read+0xbb/0x3e0 [ 128.828316] ? sed_ioctl+0x7d1/0x8a0 [ 128.828320] ? cp_new_stat+0x152/0x180 [ 128.828324] generic_file_read_iter+0xe5/0x150 [ 128.828326] ? event_triggers_call+0x5d/0xe0 [ 128.828332] blkdev_read_iter+0x38/0x70 [ 128.828334] new_sync_read+0x113/0x1a0 [ 128.828337] vfs_read+0xfe/0x1a0 [ 128.828340] ksys_read+0x67/0xe0 [ 128.828343] __x64_sys_read+0x1a/0x20 [ 128.828345] do_syscall_64+0x5c/0xc0 [ 128.828348] ? irqentry_exit+0x19/0x30 [ 128.828350] ? exc_page_fault+0x89/0x160 [ 128.828351] ? asm_exc_page_fault+0x8/0x30 [ 128.828353] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 128.828355] RIP: 0033:0x7fee5a814142 [ 128.828357] RSP: 002b:00007ffee6f25318 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 128.828358] RAX: ffffffffffffffda RBX: 00005644e529ff68 RCX: 00007fee5a814142 [ 128.828360] RDX: 0000000000000040 RSI: 00005644e529ff78 RDI: 0000000000000006 [ 128.828360] RBP: 00005644e52dd960 R08: 00005644e529ff50 R09: 00005644e5459b90 [ 128.828361] R10: 00005644e520c010 R11: 0000000000000246 R12: 0000001fffff0000 [ 128.828362] R13: 0000000000000040 R14: 00005644e529ff50 R15: 00005644e52dd9b0 [ 128.828430] task:systemd-udevd state:D stack: 0 pid: 5178 ppid: 2402 flags:0x00000000 [ 128.828433] Call Trace: [ 128.828434] __schedule+0xce6/0x13d0 [ 128.828437] ? blk_finish_plug+0x26/0x40 [ 128.828439] ? read_pages+0x1db/0x280 [ 128.828441] schedule+0x4e/0xb0 [ 128.828444] io_schedule+0x3f/0x70 [ 128.828446] wait_on_page_bit_common+0x16e/0x3b0 [ 128.828449] ? filemap_invalidate_unlock_two+0x40/0x40 [ 128.828452] put_and_wait_on_page_locked+0x4f/0x60 [ 128.828454] filemap_get_pages+0x643/0x660 [ 128.828458] filemap_read+0xbb/0x3e0 [ 128.828460] ? page_counter_try_charge+0x34/0xb0 [ 128.828466] ? page_counter_cancel+0x2c/0x70 [ 128.828468] ? page_counter_uncharge+0x22/0x40 [ 128.828470] generic_file_read_iter+0xe5/0x150 [ 128.828473] blkdev_read_iter+0x38/0x70 [ 128.828475] new_sync_read+0x113/0x1a0 [ 128.828478] vfs_read+0xfe/0x1a0 [ 128.828480] ksys_read+0x67/0xe0 [ 128.828483] __x64_sys_read+0x1a/0x20 [ 128.828485] do_syscall_64+0x5c/0xc0 [ 128.828488] ? __audit_syscall_exit+0x233/0x290 [ 128.828495] ? exit_to_user_mode_prepare+0x3d/0x1c0 [ 128.828503] ? syscall_exit_to_user_mode+0x27/0x50 [ 128.828505] ? do_syscall_64+0x69/0xc0 [ 128.828508] ? do_syscall_64+0x69/0xc0 [ 128.828510] ? do_syscall_64+0x69/0xc0 [ 128.828512] ? syscall_exit_to_user_mode+0x27/0x50 [ 128.828514] ? do_syscall_64+0x69/0xc0 [ 128.828516] ? asm_exc_page_fault+0x8/0x30 [ 128.828519] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 128.828520] RIP: 0033:0x7fee5a814142 [ 128.828522] RSP: 002b:00007ffee6f25318 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 128.828523] RAX: ffffffffffffffda RBX: 00005644e5215ba8 RCX: 00007fee5a814142 [ 128.828524] RDX: 0000000000000040 RSI: 00005644e5215bb8 RDI: 0000000000000006 [ 128.828525] RBP: 00005644e52ab260 R08: 00005644e5215b90 R09: 00005644e52b81e0 [ 128.828526] R10: 00005644e520c010 R11: 0000000000000246 R12: 000004ffffff0000 [ 128.828527] R13: 0000000000000040 R14: 00005644e5215b90 R15: 00005644e52ab2b0