On 04/26/2016 04:15 AM, Roman Pen wrote: > The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list > with the following backtrace: > > [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. > [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 > [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 > [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] > [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 > [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 > [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 > [ 601.350965] Call Trace: > [ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 > [ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80 > [ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230 > [ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220 > [ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0 > [ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 > [ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110 > [ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 > [ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70 > [ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90 > [ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0 > [ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40 > [ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140 > [ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30 > [ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70 > [ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50 > [ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0 > [ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100 > [ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0 > [ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0 > [ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50 > > The underlying device is a null_blk, with default parameters: > > queue_mode = MQ > submit_queues = 1 > > Verification that nullb0 has something inflight: > > root@pserver8:~# cat /sys/block/nullb0/inflight > 0 1 > root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; > ... > /sys/block/nullb0/mq/0/cpu2/rq_list > CTX pending: > ffff8838038e2400 > ... > > During debug it became clear that stalled request is always inserted in > the rq_list from the following path: > > save_stack_trace_tsk + 34 > blk_mq_insert_requests + 231 > blk_mq_flush_plug_list + 281 > blk_flush_plug_list + 199 > wait_on_page_bit + 192 > __filemap_fdatawait_range + 228 > filemap_fdatawait_range + 20 > filemap_write_and_wait_range + 63 > blkdev_fsync + 27 > vfs_fsync_range + 73 > blkdev_write_iter + 202 > __vfs_write + 170 > vfs_write + 169 > kernel_write + 56 > > So blk_flush_plug_list() was called with from_schedule == true. > > If from_schedule is true, that means that finally blk_mq_insert_requests() > offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, > i.e. it calls kblockd_schedule_delayed_work_on(). > > That means, that we race with another CPU, which is about to execute > __blk_mq_run_hw_queue() work. > > Further debugging shows the following traces from different CPUs: > > CPU#0 CPU#1 > ---------------------------------- ------------------------------- > reqeust A inserted > STORE hctx->ctx_map[0] bit marked > kblockd_schedule...() returns 1 > <schedule to kblockd workqueue> > request B inserted > STORE hctx->ctx_map[1] bit marked > kblockd_schedule...() returns 0 > *** WORK PENDING bit is cleared *** > flush_busy_ctxs() is executed, but > bit 1, set by CPU#1, is not observed > > As a result request B pended forever. > > This behaviour can be explained by speculative LOAD of hctx->ctx_map on > CPU#0, which is reordered with clear of PENDING bit and executed _before_ > actual STORE of bit 1 on CPU#1. > > The proper fix is an explicit full barrier <mfence>, which guarantees > that clear of PENDING bit is to be executed before all possible > speculative LOADS or STORES inside actual work function. > > Signed-off-by: Roman Pen <roman.penyaev@xxxxxxxxxxxxxxxx> > Cc: Gioh Kim <gi-oh.kim@xxxxxxxxxxxxxxxx> > Cc: Michael Wang <yun.wang@xxxxxxxxxxxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> > Cc: Jens Axboe <axboe@xxxxxxxxx> > Cc: linux-block@xxxxxxxxxxxxxxx > Cc: linux-kernel@xxxxxxxxxxxxxxx > Cc: stable@xxxxxxxxxxxxxxx > --- > kernel/workqueue.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 450c21f..0ec0594 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -649,6 +649,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, > */ > smp_wmb(); > set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); > + /* > + * The following mb guarantees that previous clear of a PENDING bit > + * will not be reordered with any speculative LOADS or STORES from > + * work->current_func, which is executed afterwards. Just to be clear: reordering with STORES in the work function is not a problem because no conclusion can be reached regarding the scheduling state of the work function from outside the work function, other than by testing PENDING. IOW, is_running = 1 clear PENDING is equivalent to clear PENDING is_running = 1 for any other logic like if (is_running) { } or if (!is_running) { } It's only LOADS in the work function that must follow STORES to PENDING. That the full barrier will also order stores is unintended effect. > This possible > + * reordering can lead to a missed execution on attempt to qeueue > + * the same @work. E.g. consider this case: > + * > + * CPU#0 CPU#1 > + * ---------------------------- -------------------------------- > + * > + * 1 STORE event_indicated > + * 2 queue_work_on() { > + * 3 test_and_set_bit(PENDING) > + * 4 } set_..._and_clear_pending() { > + * 5 set_work_data() # clear bit > + * 6 smp_mb() > + * 7 work->current_func() { > + * 8 LOAD event_indicated > + * } > + * > + * Without an explicit full barrier speculative LOAD on line 8 can > + * be executed before CPU#0 does STORE on line 1. If that happens, > + * CPU#0 observes the PENDING bit is still set and new execution of > + * a @work is not queued in a hope, that CPU#1 will eventually > + * finish the queued @work. Meanwhile CPU#1 does not see > + * event_indicated is set, because speculative LOAD was executed > + * before actual STORE. > + */ > + smp_mb(); > } > > static void clear_work_data(struct work_struct *work) > -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html