On Mon, Aug 12, 2013 at 4:45 PM, Grant Grundler <grundler@xxxxxxxxxxxx> wrote: > I've been working on an "task mmcqd/0:84 blocked for more than 120 > seconds" panic for the past month or so in the chromeos-3.4 kernel > tree. Stack trace below. Feel free to tell me "fixed in v3.x". :) I've added this change: diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 1cf4900..a127ce1 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -135,6 +135,9 @@ void mmc_request_done(struct mmc_host *host, struct mmc_request *mrq) struct mmc_command *cmd = mrq->cmd; int err = cmd->error; +WARN_ON(!host->claimed); +WARN_ON(host->claimer != current); + if (err && cmd->retries && mmc_host_is_spi(host)) { if (cmd->resp[0] & R1_SPI_ILLEGAL_COMMAND) cmd->retries = 0; and the "WARN_ON(host->claimer != current)" is triggering with this stack trace (as I suspected): WARNING: at /mnt/host/source/src/third_party/kernel/files/drivers/mmc/core/core.c:139 mmc_request_done+0x6c/0xf0() Modules linked in: i2c_dev uinput asix usbnet nf_conntrack_ipv6 nf_defrag_ipv6 uvcvideo videobuf2_vmalloc sbs_battery mwifiex_sdio mwifiex cfg80211 btmrvl_sdio btmrvl rtc_s3c bluetooth zram(C) zsmalloc(C) fuse ip6table_filter xt_mark ip6_tables joydev [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee9a0>] (dump_stack+0x20/0x24) [<804ee9a0>] (dump_stack+0x20/0x24) from [<8002bff8>] (warn_slowpath_null+0x44/0x5c) [<8002bff8>] (warn_slowpath_null+0x44/0x5c) from [<803a3e74>] (mmc_request_done+0x6c/0xf0) [<803a3e74>] (mmc_request_done+0x6c/0xf0) from [<803b9c4c>] (dw_mci_request_end+0xc4/0xec) [<803b9c4c>] (dw_mci_request_end+0xc4/0xec) from [<803ba274>] (dw_mci_tasklet_func+0x354/0x3a8) [<803ba274>] (dw_mci_tasklet_func+0x354/0x3a8) from [<80034044>] (tasklet_action+0xac/0x12c) [<80034044>] (tasklet_action+0xac/0x12c) from [<800334f8>] (__do_softirq+0xc4/0x208) [<800334f8>] (__do_softirq+0xc4/0x208) from [<80033a54>] (irq_exit+0x54/0x94) [<80033a54>] (irq_exit+0x54/0x94) from [<8000ef68>] (handle_IRQ+0x8c/0xc8) [<8000ef68>] (handle_IRQ+0x8c/0xc8) from [<800085ec>] (gic_handle_irq+0x4c/0x70) [<800085ec>] (gic_handle_irq+0x4c/0x70) from [<8000e200>] (__irq_svc+0x40/0x60) Is this expected behavior? It feels wrong to me since it means the dw_mmc tasklet and whatever thread starts the IO can access the mmc_request data structure. Do I have this right? If I have this right, any "obvious" fixes? e.g. add locking to mmc_start_request() and mmc_request_done()? thanks, grant > After staring at the 14 MMC and DW driver data structures, I now > think dw_mmc driver is accessing MMC generic data structures > (mmc_request and mmc_queue_req) without grabbing either > mmc_blk_data->lock or mmc_queue->thread_sem and it needs to. I don't > have a specific stack trace yet where dw_mmc driver is accessing MMC > generic data without protection. This is where I need some guidance. > > I am confident dw_mmc driver is always acquiring dw_mci->lock when > accessing data in dw_mci structure(s). I don't see any locking around > access into the struct mmc_request by dw_mci_slot[]->mrq though - not > sure where that belongs. > > Two questions: > 1) is there interest in adding "assert_spin_locked()" calls to > document locking dependencies? > 2) Does anyone understand this code well enough to confirm I'm on the > right track and which code path I should be looking at? > > > Back to the bug: mmc_start_req() is sleeping, waiting for the > "previous" (in flight) "async" IO to complete. (1) This IO never > completes (unlikely) OR (2) already did (e.g. mmc_host->areq is stale) > OR (3) mmc_host->areq is non-zero garbage. I'll add some code to > confirm (3) not the last case. > > I have confirmed with the stress test I'm running (many async IO in > flight with two antagonist processes that burns CPU cycles) gets about > 4 completions per second that are "done" before we call > mmc_start_req(). So I know the race in (2) could happen. > > > thanks, > grant > > > INFO: task mmcqd/0:84 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > mmcqd/0 D 804f4890 0 84 2 0x00000000 > [<804f4890>] (__schedule+0x614/0x780) from [<804f4dc0>] (schedule+0x94/0x98) > [<804f4dc0>] (schedule+0x94/0x98) from [<804f2ae0>] > (schedule_timeout+0x38/0x2d0) > [<804f2ae0>] (schedule_timeout+0x38/0x2d0) from [<804f4c14>] > (wait_for_common+0x164/0x1a0) > [<804f4c14>] (wait_for_common+0x164/0x1a0) from [<804f4d28>] > (wait_for_completion+0x20/0x24) > [<804f4d28>] (wait_for_completion+0x20/0x24) from [<803a39f8>] > (mmc_wait_for_req_done+0x2c/0x84) > [<803a39f8>] (mmc_wait_for_req_done+0x2c/0x84) from [<803a4b50>] > (mmc_start_req+0x60/0x120) > [<803a4b50>] (mmc_start_req+0x60/0x120) from [<803b09bc>] > (mmc_blk_issue_rw_rq+0xa0/0x3a8) > [<803b09bc>] (mmc_blk_issue_rw_rq+0xa0/0x3a8) from [<803b10e8>] > (mmc_blk_issue_rq+0x424/0x478) > [<803b10e8>] (mmc_blk_issue_rq+0x424/0x478) from [<803b220c>] > (mmc_queue_thread+0xb0/0x118) > [<803b220c>] (mmc_queue_thread+0xb0/0x118) from [<8004d620>] (kthread+0xa8/0xbc) > [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) > Kernel panic - not syncing: hung_task: blocked tasks > [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee160>] > (dump_stack+0x20/0x24) > [<804ee160>] (dump_stack+0x20/0x24) from [<804ee2d0>] (panic+0xa8/0x1f4) > [<804ee2d0>] (panic+0xa8/0x1f4) from [<80086d44>] (watchdog+0x1f4/0x25c) > [<80086d44>] (watchdog+0x1f4/0x25c) from [<8004d620>] (kthread+0xa8/0xbc) > [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) -- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html