Re: [bug report] worker watchdog timeout in dispatch loop for null_blk

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 10 Mar 2022 18:00:56 +0800

On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> This issue does not look critical, but let me share it to ask comments for fix.
> 
> When fio command with 40 jobs [1] is run for a null_blk device with memory
> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> creating many read requests to a single null_blk device, then the every time
> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> means more than one request was dispatched. Hence, the while loop in
> blk_mq_do_dispatch_sched() does not break.
> 
> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> {
>         int ret;
> 
>         do {
>                ret = __blk_mq_do_dispatch_sched(hctx);
>         } while (ret == 1);
> 
>         return ret;
> }
> 
> The BUG message was observed when I ran blktests block/005 with various
> conditions on a system with 40 CPUs. It was observed with kernel version
> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> guess it changed dispatch behavior for null_blk devices and triggered the
> BUG message.

It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
with 0a593fbbc245.

If queueing requests is faster than dispatching, the issue will be triggered
sooner or later, especially easy to trigger in SQ device. I am sure it can
be triggered on scsi debug, even saw such report on ahci.

> 
> I'm not so sure if we really need to fix this issue. It does not seem the real
> world problem since it is observed only with null_blk. The real block devices
> have slower IO operation then the dispatch should stop sooner when the hardware
> queue gets full. Also the 40 jobs for single device is not realistic workload.
> 
> Having said that, it does not feel right that other works are pended during
> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> using a loop counter [3] (or jiffies timeout check).

This way could work, but the queue need to be re-run after breaking
caused by max dispatch number. cond_resched() might be the simplest way,
but it can't be used here because of rcu/srcu read lock.

Thanks,
Ming