On 29 October 2018 at 02:43, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote: > > > On 10/26/18 3:25 AM, Ulf Hansson wrote: >> >> On 25 October 2018 at 23:20, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote: >>> >>> Hello mmc people, >>> >>> When booting the veyron speedy, which uses the dw_mmc driver on kernel >>> 4.19 it hangs for ~10 minutes about 1 in 10 boots. >>> This also occurs on kernel version 4.17.2. >> >> >> Do you know if this has been a problem always or is it a regression? >> That would be very nice to know. > > > I do not know for sure. What I can say is that it also occurs on version > 4.9.135 > I know its not very helpful, but I can also tell you this does not occur in > the chromeos 3.14 kernel. I've been digging through the differences, but the > chromeos 3.14 and the mainline 4.19 mmc drivers vary wildly. Yep, it's probably not worth an investigation then. > > > >>> Tracing the hang: >>> if the mmc block system fails to read a sector, mmc_blk_rq_error is >>> called, which calls hw_reset, >>> which calls _mmc_hw_reset in /drivers/mmc/core/mmc.c, >>> which finally calls >>> mmc_flush_cache(host->card) which hangs for ~10 minutes, before failing >>> and resetting the emmc. >>> >>> If the call to mmc_flush_cache(host->card) is commented out, the hang no >>> longer happens. >> >> >> Well, honestly the call to mmc_flush_cache() can be discussed. I >> wonder if it ever have work, without errors. The reason to why I think >> so, is simply because the card is in an unknown state - likely not >> being able to accept a flush request anyway. >> >> On the other hand, hanging for ~10 minutes sounds like a >> controller/driver problem, this should not happen, no matter what. >> >>> >>> >>> The errors printed after it finally recovers are: >>> [ 602.188052] mmc2: cache flush error -110 >>> [ 602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway >>> [ 603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg >>> 0x0 status 0x80202000) >>> >>> The first is printed by mmc_flush_cache, and the second two are from the >>> second half of __mmc_hw_reset, >>> when it re inits the emmc. >>> >>> Could this be due to incorrect clocks? >> >> >> Perhaps. Or that the driver/controller is in some error state, after >> the failed I/O request, which means that it fails to serve any request >> properly. >> >> There is a couple of things I would have tried. >> >> 1. Try using the mmc_test driver and verify that the hw_reset test >> works. This means you will be running the test, when the >> controller/card are in good conditions. > > > The hw_reset test works every time. I ran it in a loop over 200 times and it > never failed. I also verified the hardware reset test calls __mmc_hw_reset() > (not some other weird reset function) and thus mmc_flush_cache() Great! > >> >> 2. If 1) works, repeat the failure sequence you described above (don't >> use mmc_test no more), but replace mmc_flush_cache() in >> _mmc_hw_reset() with some other commands (try both R1 and R1B >> responses) and see what happens. None of the commands should hang. >> >> This should tell us more. > > > When I replaced the call to mmc_flush_cache() with a call to > mmc_wait_for_cmd() and tried cmd.flags = MMC_RSP_R1B and with cmd.flags = > MMC_RSP_R1. It booted properly every time. > > Let me know if that isn't the proper way to sent R1 and R1B responses. > > Earlier, I also tried replacing the call to mmc_flush_cache() with the > command to turn off the cache and turn it on again, as the kernel used to > flush the cache by just turning it off then on again. This resulted in the > same error. This more and more sounds like the card have entered an unknown state. Let me elaborate on that analyze. dw_mmc does not support MMC_CAP_WAIT_WHILE_BUSY, but implements ->card_busy(). These things are used from __mmc_switch() (called when turning off the cache to flush it), to understand when the card has completed its internal operations of flushing the cache. The card signals busy by asserting DAT0 - and for some reason it seems like it never stops signalling busy. Unfortunate, the eMMC spec doesn't define a timeout for the flush cache command, which means in mmc_poll_for_busy() we pick a default value of MMC_OPS_TIMEOUT_MS (10 minutes!). I am guessing this what is happening when mmc_flush_cache() is called from __mmc_hw_reset(). Now, could you elaborate a bit more what is happening in the case when you send the two commands (turn off change and turn on cache) manually by using mmc_wait_for_cmd()? Which one of them hangs and for how long? I am guessing it may be the second command, but that the timeout you encounter is different than 10 minutes, as it's instead caused by dw_mmc internals, as it has a polling-mechanism to avoid sending a new command if the card signals busy. Let's see. > > > Possibly related, I filed another bug with linux-arm that you can view here: > http://lists.infradead.org/pipermail/linux-arm-kernel/2018-October/609008.html > > On boot, it seems the clocks for the mmc have the wrong parents, so it > throws 8 "invalid clk rate" errors, one fore each of the mmc, sdio, etc > clocks. That may be the root cause. But in either case, it starts to look like we should have a reasonable timeout set for the flush command in the reset case. Hanging for ten minutes when trying to recover from an error, doesn't sound very practical to me. > > Thanks for you expertise, > Hal > >> >> Kind regards >> Uffe >> > Kind regards Uffe