Hi, This is RFC, because for last couple of days I hunt a mystery bug and since now I do not have a strong feeling that the following story is nothing but bug's trick and attempt to cheat me. [Sorry, the whole story and explanation are quite long] The bug is reproduced quite often on our server AMD Opteron(tm) 6386 SE with 64 cpus and 8 NUMA nodes on 4.4.5 kernel. What I observe is stalled IO request in MQ ctx->rq_list with the following backtrace: [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 [ 601.350965] Call Trace: [ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 [ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80 [ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230 [ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220 [ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0 [ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 [ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110 [ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70 [ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90 [ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0 [ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40 [ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140 [ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30 [ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70 [ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50 [ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0 [ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100 [ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0 [ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0 [ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50 The underlying device is a null_blk, with default parameters: root@pserver8:~# cat /sys/module/null_blk/parameters/queue_mode 2 # MQ root@pserver8:~# cat /sys/module/null_blk/parameters/submit_queues 1 # number of hw queues root@pserver8:~# cat /sys/module/null_blk/parameters/irqmode 1 # softirq Verification that nullb0 has something inflight: root@pserver8:~# cat /sys/block/nullb0/inflight 0 1 root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; ... /sys/block/nullb0/mq/0/cpu2/rq_list CTX pending: ffff8838038e2400 ... GDB shows that request is alive, nothing is corrupted and all bitmaps are valid: root@pserver8:~# gdb /usr/src/linux-4.4.5-1-storage/vmlinux /proc/kcore (gdb) p (struct request *)0xffff8838038e2400 $1 = (struct request *) 0xffff8838038e2400 (gdb) p *$1 ... atomic_flags = 1, cpu = -1, __data_len = 98304, __sector = 192210432, ... bio = 0xffff8807ef64d000, biotail = 0xffff8807ef64df00, start_time = 4294963642, ... (gdb) p $1->mq_ctx $3 = (struct blk_mq_ctx *) 0xffffe8c7e0c9d300 (gdb) p $1->mq_ctx->cpu $4 = 2 (gdb) p $1->q->queue_hw_ctx[$1->q->mq_map[2]] $5 = (struct blk_mq_hw_ctx *) 0xffff8807f1967c00 (gdb) p $5->ctx_map $6 = { size = 8, bits_per_word = 8, map = 0xffff8807f1bb6200 } (gdb) p *$5->ctx_map.map@8 $18 = {{ word = 4, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }} (gdb) p &$5->ctxs[0 + 2]->rq_list $23 = (struct list_head *) 0xffffe8c7e0c9d308 (gdb) p $5->ctxs[0 + 2]->rq_list $24 = { next = 0xffff8838038e2400, prev = 0xffff8838038e2400 <<<<< Our pended request } During debug it became clear that stalled request is always inserted in the rq_list from the following path: save_stack_trace_tsk + 34 blk_mq_insert_requests + 231 blk_mq_flush_plug_list + 281 blk_flush_plug_list + 199 wait_on_page_bit + 192 __filemap_fdatawait_range + 228 filemap_fdatawait_range + 20 filemap_write_and_wait_range + 63 blkdev_fsync + 27 vfs_fsync_range + 73 blkdev_write_iter + 202 __vfs_write + 170 vfs_write + 169 kernel_write + 56 So blk_flush_plug_list() was called with from_schedule == true. If from_schedule is true, that means that finally blk_mq_insert_requests() offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, i.e. it calls kblockd_schedule_delayed_work_on(). That can mean, that we race with another CPU, which is about to execute __blk_mq_run_hw_queue() work. Ok, then I decided to trace deeper and to understand does the race exist, and if yes, is that because of a probable pending hctx->run_work. For further tracing I used 4 new events: 1. insert of a request in a rq_list 2. hctx marked as pending 3. did kblockd_schedule_delayed_work_on() succeed or not 4. actual execution of flush_busy_ctxs() and inserted these events as in hunks below: @@ -690,6 +690,8 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list) struct blk_mq_ctx *ctx; int i; + trace_hctx_flush_busy_ctxs(hctx); + for (i = 0; i < hctx->ctx_map.size; i++) { struct blk_align_bitmap *bm = &hctx->ctx_map.map[i]; @@ -872,8 +876,9 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) put_cpu(); } - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), - &hctx->run_work, 0); + queued = kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), + &hctx->run_work, 0); + trace_hctx_delayed_work_queued(hctx, queued); @@ -1047,8 +1052,10 @@ static void blk_mq_insert_requests(struct request_queue *q, list_del_init(&rq->queuelist); rq->mq_ctx = ctx; __blk_mq_insert_req_list(hctx, ctx, rq, false); + trace_hctx_insert_req_list(hctx, rq); } blk_mq_hctx_mark_pending(hctx, ctx); + trace_hctx_mark_pending(hctx); spin_unlock(&ctx->lock); I have to say, that I could not use kernel trace events framework (i.e. this one include/trace/events), because bug of course immediately disappears. I wrote my own static lockless ring buffer with capacity for 64 events. But I could not use any sorts of atomics - bug disappears. My ring buffer was implemented per CPU, but then I had to sort all the events out using some proper global accurate clock. But I could not use sched_clock() - bug disappears. I chose raw rdtsc(), and again, not rdtsc_ordered(), but unordered variant. rdtsc_ordered() again "fixes" the bug. So frankly I am not sure can I trust rdtsc(), which again can be reordered, but ok, no other option. So here is the sorted output of the tracer of another stalled request: rdtsc=849688091004, cpu= 6, INS_REQ hctx=ffff8800d46c0000, rq=ffff884000343600 rdtsc=849688091304, cpu= 6, MARK_PEND hctx=ffff8800d46c0000 rdtsc=849688099595, cpu= 6, KBLK_SCHED hctx=ffff8800d46c0000, queued=1 rdtsc=849688106102, cpu= 6, FLUSH_BUSY hctx=ffff8800d46c0000 rdtsc=849688109586, cpu= 2, INS_REQ hctx=ffff8800d46c0000, rq=ffff884000343cc0 rdtsc=849688109957, cpu= 2, MARK_PEND hctx=ffff8800d46c0000 rdtsc=849688110921, cpu= 2, KBLK_SCHED hctx=ffff8800d46c0000, queued=0 I see always the same picture for all runs I did (of course CPUs number and rq addresses can vary): CPU#6 CPU#2 reqeust ffff884000343600 inserted hctx marked as pended kblockd_schedule...() returns 1 <schedule to kblockd workqueue> *** WORK_STRUCT_PENDING_BIT is cleared *** flush_busy_ctxs() is executed reqeust ffff884000343cc0 inserted hctx marked as pended kblockd_schedule...() returns 0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ WTF? As a result ffff884000343cc0 request pended forever. According to the trace output I see that another CPU _always_ observes WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was cleared on another CPU. Checking the workqueue.c code I see that clearing the bit is nothing more, but atomic_long_set(), which is <mov> instruction. This function: static inline void set_work_data() In attempt to "fix" the mystery I replaced atomic_long_set() call with atomic_long_xchg(), which is <xchg> instruction. The problem has gone. For me it looks that test_and_set_bit() (<lock btsl> instruction) does not require flush of all CPU caches, which can be dirty after executing of <mov> on another CPU. But <xchg> really updates the memory and the following execution of <lock btsl> observes that bit was cleared. As a conculusion I can say, that I am lucky enough and can reproduce this bug in several minutes on a specific load (I tried many other simple loads using fio, even using btrecord/btreplay, no success). And that easy reproduction on a specific load gives me some freedom to test and then to be sure, that problem has gone. Signed-off-by: Roman Pen <roman.penyaev@xxxxxxxxxxxxxxxx> Cc: Tejun Heo <tj@xxxxxxxxxx> Cc: Jens Axboe <axboe@xxxxxxxxx> Cc: linux-block@xxxxxxxxxxxxxxx Cc: linux-kernel@xxxxxxxxxxxxxxx --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 450c21f..60cd6db 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -621,7 +621,7 @@ static inline void set_work_data(struct work_struct *work, unsigned long data, unsigned long flags) { WARN_ON_ONCE(!work_pending(work)); - atomic_long_set(&work->data, data | flags | work_static(work)); + atomic_long_xchg(&work->data, data | flags | work_static(work)); } static void set_work_pwq(struct work_struct *work, struct pool_workqueue *pwq, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html