On 30 October 2018 at 06:48, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote: > > > On 10/29/18 5:59 AM, Ulf Hansson wrote: >> >> On 29 October 2018 at 02:43, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote: >>> >>> >>> >>> On 10/26/18 3:25 AM, Ulf Hansson wrote: >>>> >>>> >>>> On 25 October 2018 at 23:20, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> >>>>> Hello mmc people, >>>>> >>>>> When booting the veyron speedy, which uses the dw_mmc driver on kernel >>>>> 4.19 it hangs for ~10 minutes about 1 in 10 boots. >>>>> This also occurs on kernel version 4.17.2. >>>> >>>> >>>> >>>> Do you know if this has been a problem always or is it a regression? >>>> That would be very nice to know. >>> >>> >>> >>> I do not know for sure. What I can say is that it also occurs on version >>> 4.9.135 >>> I know its not very helpful, but I can also tell you this does not occur >>> in >>> the chromeos 3.14 kernel. I've been digging through the differences, but >>> the >>> chromeos 3.14 and the mainline 4.19 mmc drivers vary wildly. >> >> >> Yep, it's probably not worth an investigation then. >> >>> >>> >>> >>>>> Tracing the hang: >>>>> if the mmc block system fails to read a sector, mmc_blk_rq_error is >>>>> called, which calls hw_reset, >>>>> which calls _mmc_hw_reset in /drivers/mmc/core/mmc.c, >>>>> which finally calls >>>>> mmc_flush_cache(host->card) which hangs for ~10 minutes, before failing >>>>> and resetting the emmc. >>>>> >>>>> If the call to mmc_flush_cache(host->card) is commented out, the hang >>>>> no >>>>> longer happens. >>>> >>>> >>>> >>>> Well, honestly the call to mmc_flush_cache() can be discussed. I >>>> wonder if it ever have work, without errors. The reason to why I think >>>> so, is simply because the card is in an unknown state - likely not >>>> being able to accept a flush request anyway. >>>> >>>> On the other hand, hanging for ~10 minutes sounds like a >>>> controller/driver problem, this should not happen, no matter what. >>>> >>>>> >>>>> >>>>> The errors printed after it finally recovers are: >>>>> [ 602.188052] mmc2: cache flush error -110 >>>>> [ 602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway >>>>> [ 603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg >>>>> 0x0 status 0x80202000) >>>>> >>>>> The first is printed by mmc_flush_cache, and the second two are from >>>>> the >>>>> second half of __mmc_hw_reset, >>>>> when it re inits the emmc. >>>>> >>>>> Could this be due to incorrect clocks? >>>> >>>> >>>> >>>> Perhaps. Or that the driver/controller is in some error state, after >>>> the failed I/O request, which means that it fails to serve any request >>>> properly. >>>> >>>> There is a couple of things I would have tried. >>>> >>>> 1. Try using the mmc_test driver and verify that the hw_reset test >>>> works. This means you will be running the test, when the >>>> controller/card are in good conditions. >>> >>> >>> >>> The hw_reset test works every time. I ran it in a loop over 200 times and >>> it >>> never failed. I also verified the hardware reset test calls >>> __mmc_hw_reset() >>> (not some other weird reset function) and thus mmc_flush_cache() >> >> >> Great! >> >>> >>>> >>>> 2. If 1) works, repeat the failure sequence you described above (don't >>>> use mmc_test no more), but replace mmc_flush_cache() in >>>> _mmc_hw_reset() with some other commands (try both R1 and R1B >>>> responses) and see what happens. None of the commands should hang. >>>> >>>> This should tell us more. >>> >>> >>> >>> When I replaced the call to mmc_flush_cache() with a call to >>> mmc_wait_for_cmd() and tried cmd.flags = MMC_RSP_R1B and with cmd.flags = >>> MMC_RSP_R1. It booted properly every time. >>> >>> Let me know if that isn't the proper way to sent R1 and R1B responses. >>> >>> Earlier, I also tried replacing the call to mmc_flush_cache() with the >>> command to turn off the cache and turn it on again, as the kernel used to >>> flush the cache by just turning it off then on again. This resulted in >>> the >>> same error. >> >> >> This more and more sounds like the card have entered an unknown state. >> Let me elaborate on that analyze. >> >> dw_mmc does not support MMC_CAP_WAIT_WHILE_BUSY, but implements >> ->card_busy(). These things are used from __mmc_switch() (called when >> turning off the cache to flush it), to understand when the card has >> completed its internal operations of flushing the cache. The card >> signals busy by asserting DAT0 - and for some reason it seems like it >> never stops signalling busy. >> >> Unfortunate, the eMMC spec doesn't define a timeout for the flush >> cache command, which means in mmc_poll_for_busy() we pick a default >> value of MMC_OPS_TIMEOUT_MS (10 minutes!). >> >> I am guessing this what is happening when mmc_flush_cache() is called >> from __mmc_hw_reset(). > > > That seems correct correct. I somehow overlooked the fact that this > > [ 602.187067] mmc2: Card stuck being busy! mmc_poll_for_busy > > is printed immediately before > > [ 602.188052] mmc2: cache flush error -110 > [ 602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway > [ 603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg > 0x0 status 0x80202000) > > which basically confirms that. Sorry about missing that. > > Additionally, I found that these lines > > [ 602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway > [ 603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg > 0x0 status 0x80202000) > > are from _mmc_hw_reset() calling mmc_init_card() which calls some functions > that eventually call mci_send_cmd(). > > _mmc_hw_reset() calls mmc_init_card() after the power cycle is complete. > > Since that command is sent in the process of re-init-ing the card _after_ > the power cycle/ power reset, I believe reset is not functioning properly > with this card. Okay. Can you check which of reset path that is taken from _mmc_hw_reset()? Either you should end up calling dw_mci_hw_reset() or you must have a mmc pwrseq to be called via mmc_pwrseq_reset(). The first option means that you need the "cap-mmc-hw-reset" DT property to be set for you mmc host node. The second means the you should have a "mmc-pwrseq" handle in you mmc host node. If neither exists, I doubt reset can work. > > Maybe some commands are sent to the emmc by the bootloader on a normal boot > that the kernel mmc init process relies on for this card? > >> >> Now, could you elaborate a bit more what is happening in the case when >> you send the two commands (turn off change and turn on cache) manually >> by using mmc_wait_for_cmd()? Which one of them hangs and for how long? >> >> I am guessing it may be the second command, but that the timeout you >> encounter is different than 10 minutes, as it's instead caused by >> dw_mmc internals, as it has a polling-mechanism to avoid sending a new >> command if the card signals busy. Let's see. > > > > I used > err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL, > EXT_CSD_CACHE_CTRL, 0, timeout); > > to turn off the cache, and then swapped the 0 for a 1 to turn it back on. > This is how the mmc_cache_control function did it in 3.14. The comment on > the function says that turning the cache off shall trigger flushing the > cache. > > The initial command to turn off the cache causes the 10 minute hang. > The command to then turn it back on calls mmc_wait_for_cmd() which > immediately returns an error and without the 10 minute hang. So that's > probably due to the polling-mechanism you mentioned. Yep, make sense. > >> >>> >>> >>> Possibly related, I filed another bug with linux-arm that you can view >>> here: >>> >>> http://lists.infradead.org/pipermail/linux-arm-kernel/2018-October/609008.html >>> >>> On boot, it seems the clocks for the mmc have the wrong parents, so it >>> throws 8 "invalid clk rate" errors, one fore each of the mmc, sdio, etc >>> clocks. >> >> >> That may be the root cause. >> >> But in either case, it starts to look like we should have a reasonable >> timeout set for the flush command in the reset case. Hanging for ten >> minutes when trying to recover from an error, doesn't sound very >> practical to me. >> > > I would agree. Definitely not practical. > Do you have any suggestions what the parents of the mmc/sdio clocks should > be or a place to start for me to figure it out the correct values? I tried > looking for documentations or examples but I'm not finding much. Sorry, when it comes to the platform configurations, I can't help much. You need to ping some of rockchip people. Kind regards Uffe