Hi, On Wed, Mar 29, 2017 at 6:17 PM, Brian Norris <briannorris at chromium.org> wrote: > Hi all, > > I haven't managed to get as far as a bugfix for this, but I've bisected > some issues seen on v4.10+ with a Chromebook of the Veyron family (Jaq, > in particular). v4.9 works fine. OK, I finally got everything up and running to test this too... > Issue #1 - eMMC complains periodically: > > [ 4.358135] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 4.461466] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 5.291450] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 5.381471] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 11.243337] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 17.371628] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) I don't believe that this is an error, actually. Really we just need to quiet this message or (since I've always found it useful) move it to a different place. I believe that with runtime PM we're effectively turning the clock off whenever the MMC device isn't in use. On dw_mmc we have a "helpful" printout every time the clock is changed, and that's what you're seeing here. You can see with: while true; do dd if=/dev/mmcblk2 of=/dev/null bs=512 count=1 iflag=direct; sleep .1; done ...that you'll get a printout every 100ms. Ah, looks like this is in: ce69e2fea093 mmc: dw_mmc: silent verbose log when calling from PM context ...as pointed out by Shawn Lin. So I think in 4.10 we can just ignore those messages and they're good on 4.11. > and if I stress it out at all (e.g., dd if=/dev/mmcblk2 bs=1M > > /dev/null), it will eventually croak: > > [ 359.916315] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 360.071378] dwmmc_rockchip ff0f0000.dwmmc: Successfully tuned phase to 153 > [ 360.211351] mmcblk2: error -110 transferring data, sector 8644608, nr 2048, cmd response 0x900, card status 0x0 > [ 360.221936] mmcblk2: retrying using single block read > [ 363.491362] mmcblk2: error -110 transferring data, sector 8646656, nr 2048, cmd response 0x900, card status 0x0 > [ 363.531569] mmc_host mmc2: Bus speed (slot 0) = 400000Hz (slot req 400000Hz, actual 400000HZ div = 0) > [ 363.596326] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 363.612712] dwmmc_rockchip ff0f0000.dwmmc: Successfully tuned phase to 152 > [ 363.751351] mmcblk2: error -110 transferring data, sector 8646656, nr 2048, cmd response 0x900, card status 0x0 > [ 363.761938] mmcblk2: retrying using single block read > [ 366.611356] INFO: task mmcqd/2boot1:92 blocked for more than 120 seconds. > [ 366.618134] Not tainted 4.10.0 #284 > [ 366.622146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 366.629960] mmcqd/2boot1 D 0 92 2 0x00000000 > [ 366.635454] [<c07dc21c>] (__schedule) from [<c07dc4e0>] (schedule+0x90/0xa0) > [ 366.642497] [<c07dc4e0>] (schedule) from [<c066e8b4>] (__mmc_claim_host+0xd4/0x19c) > [ 366.650142] [<c066e8b4>] (__mmc_claim_host) from [<c066e9ac>] (mmc_get_card+0x30/0x34) > [ 366.658056] [<c066e9ac>] (mmc_get_card) from [<c067fc8c>] (mmc_blk_issue_rq+0x64/0x48c) > [ 366.666052] [<c067fc8c>] (mmc_blk_issue_rq) from [<c0680230>] (mmc_queue_thread+0x114/0x1b4) > [ 366.674484] [<c0680230>] (mmc_queue_thread) from [<c023d1b0>] (kthread+0x128/0x144) > [ 366.682134] [<c023d1b0>] (kthread) from [<c02076e8>] (ret_from_fork+0x14/0x2c) I'm not convinced this is a regression. I remember Heiko saying that he's heard reports that on some boards eMMC doesn't work with high speed, and I'be believe that's what you're seeing here. It would be interesting to try to debug this. I can't personally reproduce, though. I think veyron_minnie already has UHS turned off for eMMC upstream. I guess we could do it for other veyron boards too until someone can debug? > Issue #2 - Wifi (via SDIO, mmc1) is completely dead: > > [ 1.444125] mmc_host mmc1: card is non-removable. > [ 1.471368] mmc_host mmc1: Bus speed (slot 0) = 400000Hz (slot req 400000Hz, actual 400000HZ div = 0) > [ 1.619553] mmc_host mmc1: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 1.881699] mmc1: new ultra high speed SDR104 SDIO card at address 0001 > [ 25.681172] mmc_host mmc1: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 25.691666] mwifiex: rx work enabled, cpus 4 > [ 26.827000] mwifiex_sdio mmc1:0001:1: info: FW download over, size 800344 bytes > [ 27.561352] mwifiex_sdio mmc1:0001:1: WLAN FW is active > [ 33.585165] mmc_host mmc1: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 37.651344] mwifiex_sdio mmc1:0001:1: mwifiex_cmd_timeout_func: Timeout cmd id = 0xa9, act = 0x0 > [ 37.660122] mwifiex_sdio mmc1:0001:1: num_data_h2c_failure = 0 > [ 37.665951] mwifiex_sdio mmc1:0001:1: num_cmd_h2c_failure = 0 > [ 37.671688] mwifiex_sdio mmc1:0001:1: is_cmd_timedout = 1 > [ 37.677076] mwifiex_sdio mmc1:0001:1: num_tx_timeout = 0 > [ 37.682380] mwifiex_sdio mmc1:0001:1: last_cmd_index = 1 > [ 37.687681] mwifiex_sdio mmc1:0001:1: last_cmd_id: 00 00 a9 00 00 00 00 00 00 00 > [ 37.695066] mwifiex_sdio mmc1:0001:1: last_cmd_act: 00 00 00 00 00 00 00 00 00 00 > [ 37.702536] mwifiex_sdio mmc1:0001:1: last_cmd_resp_index = 0 > [ 37.708269] mwifiex_sdio mmc1:0001:1: last_cmd_resp_id: 00 00 00 00 00 00 00 00 00 00 > [ 37.716087] mwifiex_sdio mmc1:0001:1: last_event_index = 0 > [ 37.721564] mwifiex_sdio mmc1:0001:1: last_event: 00 00 00 00 00 00 00 00 00 00 > [ 37.728857] mwifiex_sdio mmc1:0001:1: data_sent=1 cmd_sent=0 > [ 37.734508] mwifiex_sdio mmc1:0001:1: ps_mode=0 ps_state=0 > [ 37.740016] mmc_host mmc1: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0) > [ 37.750268] mwifiex_sdio mmc1:0001:1: info: mwifiex_fw_dpc: unregister device This doesn't surprise me at all. What surprises me, though, is that nobody else seems to be able to reproduce this. On veyron, WiFi is connected via SDIO. For good speed, it uses SDIO Interrupts. See this bit in the device tree: cap-sdio-irq; SDIO interrupts (in 4-bit mode) specifically need the card clock to be running all the time to work. I can reproduce your regression (on veryron-jerry, which also has Marvell WiFi) and I can also find that the regression is "gone' if I take out the "cap-sdio-irq" in the veyron device tree. Ah, interestingly enough, turning off SDIO interrupts has the side effect of sending enough (polling) traffic that we never seem to runtime suspend, either. ;-P In general I'd question whether dw_mmc actually gets much power benefit from Runtime PM in Linux. The dw_mmc IP blocks already have a feature in them to automatically stop and restart the card clock. See SDMMC_CLKEN_LOW_PWR. Maybe you're getting the benefit of turning off VMMC or VQMMC? Is that really a lot of power? Presumably those power savings would be for eMMC or normal SD cards (not SDIO). Maybe someone else on this thread knows how Runtime PM is supposed to work in general for SDIO? I notice that in sdio.c the mmc_sdio_runtime_suspend() unconditionally calls mmc_power_off(). That seems odd since the main mmc_sdio_suspend() _doesn't_ call it if mmc_card_keep_power(). Hrmmm... OK, so I just tried this on veyron-minnie. On minnie we have Broadcom WiFi. That actually works (!). Presumably this is because brcmf_sdiod_host_fixup() calls pm_runtime_forbid(). Commenting that out breaks things. OK, and I can make Marvell work by adding "pm_runtime_forbid(func->card->host->parent);" to the end of mwifiex_sdio_probe(). -- So where does that leave us? A) Technically we can fix Marvell's driver to work like Broadcom's. One could possibly assert that this is the wrong fix because technically we could make Runtime PM work with SDIO with enough work. We could theoretically move into 1-bit mode and there (I think) you can get interrupts with the clock off. ...or we could have a dedicated SDIO Interrupt pin (for the embedded case), which is talked about in the SDIO spec. B) Technically we could hack this in the dw_mmc code to disable Runtime PM if we see that an SDIO interrupt is used. One advantage of doing it here is that if we ever add support in dw_mmc for the external SDIO interrupt we could allow Runtime PM in that case. In theory the dw_mmc IP block has some basic support for a dedicated SDIO interrupt pin, but there's no code to support it. C) Technically we could add this into the MMC core. D) Technically we could remove Runtime PM support from dw_mmc for now until someone can address all these issues (and ideally show a real power savings). I'd tend to vote for D, but I've been pretty absent from dw_mmc for a long time, so probably my vote isn't worth that much... Shawn: I think you actually enabled runtime PM. Did you really see power savings, or did it just seem like enabling Runtime PM would be a neat thing to do? -Doug