On Thu, 2019-01-10 at 21:40 +0000, Zak Hays wrote: > Hello all, > > After upgrading to kernel version v4.17, I see hangs one out of every > 200 boots or so. I then see the following hung tasks: > > INFO: task kblockd:30 blocked for more than 120 seconds. > Tainted: P O 4.17.19-yocto-standard- > edf324cbd3b997d05686954a2e8e5d27 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > kblockd D 0 30 2 0x00000000 > Workqueue: kblockd blk_mq_run_work_fn > [<c064382c>] (__schedule) from [<c0643a6c>] (schedule+0xa4/0xd0) > [<c0643a6c>] (schedule) from [<c04ffac0>] > (__mmc_claim_host+0x12c/0x238) > [<c04ffac0>] (__mmc_claim_host) from [<c04ffc04>] > (mmc_get_card+0x38/0x3c) > [<c04ffc04>] (mmc_get_card) from [<c0513e44>] > (mmc_mq_queue_rq+0x104/0x1fc) > [<c0513e44>] (mmc_mq_queue_rq) from [<c02f8378>] > (blk_mq_dispatch_rq_list+0x380/0x4b0) > [<c02f8378>] (blk_mq_dispatch_rq_list) from [<c02fc2cc>] > (blk_mq_do_dispatch_sched+0xf8/0x110) > [<c02fc2cc>] (blk_mq_do_dispatch_sched) from [<c02fca38>] > (blk_mq_sched_dispatch_requests+0x160/0x1d0) > [<c02fca38>] (blk_mq_sched_dispatch_requests) from [<c02f63b4>] > (__blk_mq_run_hw_queue+0x120/0x168) > [<c02f63b4>] (__blk_mq_run_hw_queue) from [<c02f6434>] > (blk_mq_run_work_fn+0x38/0x3c) > [<c02f6434>] (blk_mq_run_work_fn) from [<c0047890>] > (process_one_work+0x288/0x474) > [<c0047890>] (process_one_work) from [<c0047ab4>] > (process_scheduled_works+0x38/0x3c) > [<c0047ab4>] (process_scheduled_works) from [<c00486a8>] > (rescuer_thread+0x1f8/0x35c) > [<c00486a8>] (rescuer_thread) from [<c004d948>] (kthread+0x158/0x174) > [<c004d948>] (kthread) from [<c00090e4>] (ret_from_fork+0x14/0x30) > Exception stack(0xe1dd3fb0 to 0xe1dd3ff8) > 3fa0: 00000000 00000000 00000000 > 00000000 > 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > 00000000 > 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 > INFO: task kworker/1:1H:91 blocked for more than 120 seconds. > Tainted: P O 4.17.19-yocto-standard- > edf324cbd3b997d05686954a2e8e5d27 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > kworker/1:1H D 0 91 2 0x00000000 > Workqueue: kblockd blk_mq_run_work_fn > [<c064382c>] (__schedule) from [<c0643a6c>] (schedule+0xa4/0xd0) > [<c0643a6c>] (schedule) from [<c04ffac0>] > (__mmc_claim_host+0x12c/0x238) > [<c04ffac0>] (__mmc_claim_host) from [<c04ffc04>] > (mmc_get_card+0x38/0x3c) > [<c04ffc04>] (mmc_get_card) from [<c0513e44>] > (mmc_mq_queue_rq+0x104/0x1fc) > [<c0513e44>] (mmc_mq_queue_rq) from [<c02f8378>] > (blk_mq_dispatch_rq_list+0x380/0x4b0) > [<c02f8378>] (blk_mq_dispatch_rq_list) from [<c02fc2cc>] > (blk_mq_do_dispatch_sched+0xf8/0x110) > [<c02fc2cc>] (blk_mq_do_dispatch_sched) from [<c02fca38>] > (blk_mq_sched_dispatch_requests+0x160/0x1d0) > [<c02fca38>] (blk_mq_sched_dispatch_requests) from [<c02f63b4>] > (__blk_mq_run_hw_queue+0x120/0x168) > [<c02f63b4>] (__blk_mq_run_hw_queue) from [<c02f6434>] > (blk_mq_run_work_fn+0x38/0x3c) > [<c02f6434>] (blk_mq_run_work_fn) from [<c0047890>] > (process_one_work+0x288/0x474) > [<c0047890>] (process_one_work) from [<c0048abc>] > (worker_thread+0x2b0/0x428) > [<c0048abc>] (worker_thread) from [<c004d948>] (kthread+0x158/0x174) > [<c004d948>] (kthread) from [<c00090e4>] (ret_from_fork+0x14/0x30) > Exception stack(0xc19cffb0 to 0xc19cfff8) > ffa0: 00000000 00000000 00000000 > 00000000 > ffc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > 00000000 > ffe0: 00000000 00000000 00000000 00000000 00000013 00000000 > > After bisecting through the commits, I've found the hangs started > after this commit: > > 81196976ed94 Adrian Hunter Wed Nov 29 15:41:03 2017 +0200 mmc: > block: Add blk-mq support > > I'm not sure however what about this particular commit is the source > of the problem. > > It appears like multiple tasks are trying to claim the host but > whatever task is responsible for releasing it isn't getting triggered. > If I dump the blocked tasks, I don't see any other mmc-related tasks > other than the two above. > > Has anyone run into this issue before? If not, does anyone have any > ideas what might be causing the problem? > > Thanks, > Zak Hays In particular, our tracing shows that mmc_blk_mq_req_done is calling kblockd_schedule_work, which should cause mmc_blk_mq_complete_work to run, which will do an mmc_put_card() and unblock the tasks in mmc_get_card(). However, we do not see mmc_blk_mq_complete_work run, possibly because kblockd and kworker/1:1H are blocked in mmc_get_card() -- Steven Walter <steven.walter@xxxxxxxxxxx>