Re: [BUG] mmc: dw_mmc*: mmc2: cache flush error -110 hang

Ulf Hansson <ulf.hansson@xxxxxxxxxx> · Tue, 30 Oct 2018 13:47:18 +0100

On 30 October 2018 at 06:48, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote:
>
>
> On 10/29/18 5:59 AM, Ulf Hansson wrote:
>>
>> On 29 October 2018 at 02:43, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote:
>>>
>>>
>>>
>>> On 10/26/18 3:25 AM, Ulf Hansson wrote:
>>>>
>>>>
>>>> On 25 October 2018 at 23:20, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> Hello mmc people,
>>>>>
>>>>> When booting the veyron speedy, which uses the dw_mmc driver on kernel
>>>>> 4.19 it hangs for ~10 minutes about 1 in 10 boots.
>>>>> This also occurs on kernel version 4.17.2.
>>>>
>>>>
>>>>
>>>> Do you know if this has been a problem always or is it a regression?
>>>> That would be very nice to know.
>>>
>>>
>>>
>>> I do not know for sure. What I can say is that it also occurs on version
>>> 4.9.135
>>> I know its not very helpful, but I can also tell you this does not occur
>>> in
>>> the chromeos 3.14 kernel. I've been digging through the differences, but
>>> the
>>> chromeos 3.14 and the mainline 4.19 mmc drivers vary wildly.
>>
>>
>> Yep, it's probably not worth an investigation then.
>>
>>>
>>>
>>>
>>>>> Tracing the hang:
>>>>> if the mmc block system fails to read a sector, mmc_blk_rq_error is
>>>>> called, which calls hw_reset,
>>>>> which calls _mmc_hw_reset in /drivers/mmc/core/mmc.c,
>>>>> which finally calls
>>>>> mmc_flush_cache(host->card) which hangs for ~10 minutes, before failing
>>>>> and resetting the emmc.
>>>>>
>>>>> If the call to mmc_flush_cache(host->card) is commented out, the hang
>>>>> no
>>>>> longer happens.
>>>>
>>>>
>>>>
>>>> Well, honestly the call to mmc_flush_cache() can be discussed. I
>>>> wonder if it ever have work, without errors. The reason to why I think
>>>> so, is simply because the card is in an unknown state - likely not
>>>> being able to accept a flush request anyway.
>>>>
>>>> On the other hand, hanging for ~10 minutes sounds like a
>>>> controller/driver problem, this should not happen, no matter what.
>>>>
>>>>>
>>>>>
>>>>> The errors printed after it finally recovers are:
>>>>> [  602.188052] mmc2: cache flush error -110
>>>>> [  602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway
>>>>> [  603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg
>>>>> 0x0 status 0x80202000)
>>>>>
>>>>> The first is printed by mmc_flush_cache, and the second two are from
>>>>> the
>>>>> second half of __mmc_hw_reset,
>>>>> when it re inits the emmc.
>>>>>
>>>>> Could this be due to incorrect clocks?
>>>>
>>>>
>>>>
>>>> Perhaps. Or that the driver/controller is in some error state, after
>>>> the failed I/O request, which means that it fails to serve any request
>>>> properly.
>>>>
>>>> There is a couple of things I would have tried.
>>>>
>>>> 1. Try using the mmc_test driver and verify that the hw_reset test
>>>> works. This means you will be running the test, when the
>>>> controller/card are in good conditions.
>>>
>>>
>>>
>>> The hw_reset test works every time. I ran it in a loop over 200 times and
>>> it
>>> never failed. I also verified the hardware reset test calls
>>> __mmc_hw_reset()
>>> (not some other weird reset function) and thus mmc_flush_cache()
>>
>>
>> Great!
>>
>>>
>>>>
>>>> 2. If 1) works, repeat the failure sequence you described above (don't
>>>> use mmc_test no more), but replace mmc_flush_cache() in
>>>> _mmc_hw_reset() with some other commands (try both R1 and R1B
>>>> responses) and see what happens. None of the commands should hang.
>>>>
>>>> This should tell us more.
>>>
>>>
>>>
>>> When I replaced the call to mmc_flush_cache() with a call to
>>> mmc_wait_for_cmd() and tried cmd.flags = MMC_RSP_R1B and with cmd.flags =
>>> MMC_RSP_R1. It booted properly every time.
>>>
>>> Let me know if that isn't the proper way to sent R1 and R1B responses.
>>>
>>> Earlier, I also tried replacing the call to mmc_flush_cache() with the
>>> command to turn off the cache and turn it on again, as the kernel used to
>>> flush the cache by just turning it off then on again. This resulted in
>>> the
>>> same error.
>>
>>
>> This more and more sounds like the card have entered an unknown state.
>> Let me elaborate on that analyze.
>>
>> dw_mmc does not support MMC_CAP_WAIT_WHILE_BUSY, but implements
>> ->card_busy(). These things are used from __mmc_switch() (called when
>> turning off the cache to flush it), to understand when the card has
>> completed its internal operations of flushing the cache. The card
>> signals busy by asserting DAT0 - and for some reason it seems like it
>> never stops signalling busy.
>>
>> Unfortunate, the eMMC spec doesn't define a timeout for the flush
>> cache command, which means in mmc_poll_for_busy() we pick a default
>> value of MMC_OPS_TIMEOUT_MS (10 minutes!).
>>
>> I am guessing this what is happening when mmc_flush_cache() is called
>> from __mmc_hw_reset().
>
>
> That seems correct correct. I somehow overlooked the fact that this
>
> [  602.187067] mmc2: Card stuck being busy! mmc_poll_for_busy
>
> is printed immediately before
>
> [  602.188052] mmc2: cache flush error -110
> [  602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway
> [  603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg
> 0x0 status 0x80202000)
>
>  which basically confirms that. Sorry about missing that.
>
> Additionally, I found that these lines
>
> [  602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway
> [  603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg
> 0x0 status 0x80202000)
>
> are from _mmc_hw_reset() calling mmc_init_card() which calls some functions
> that eventually call mci_send_cmd().
>
> _mmc_hw_reset() calls mmc_init_card() after the power cycle is complete.
>
> Since that command is sent in the process of re-init-ing the card _after_
> the power cycle/ power reset, I believe reset is not functioning properly
> with this card.

Okay.

Can you check which of reset path that is taken from _mmc_hw_reset()?

Either you should end up calling dw_mci_hw_reset() or you must have a
mmc pwrseq to be called via mmc_pwrseq_reset().

The first option means that you need the "cap-mmc-hw-reset" DT
property to be set for you mmc host node. The second means the you
should have a "mmc-pwrseq" handle in you mmc host node.

If neither exists, I doubt reset can work.

>
> Maybe some commands are sent to the emmc by the bootloader on a normal boot
> that the kernel mmc init process relies on for this card?
>
>>
>> Now, could you elaborate a bit more what is happening in the case when
>> you send the two commands (turn off change and turn on cache) manually
>> by using mmc_wait_for_cmd()? Which one of them hangs and for how long?
>>
>> I am guessing it may be the second command, but that the timeout you
>> encounter is different than 10 minutes, as it's instead caused by
>> dw_mmc internals, as it has a polling-mechanism to avoid sending a new
>> command if the card signals busy. Let's see.
>
>
>
> I used
> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
>                                         EXT_CSD_CACHE_CTRL, 0, timeout);
>
> to turn off the cache, and then swapped the 0 for a 1 to turn it back on.
> This is how the mmc_cache_control function did it in 3.14. The comment on
> the function says that turning the cache off shall trigger flushing the
> cache.
>
> The initial command to turn off the cache causes the 10 minute hang.
> The command to then turn it back on calls mmc_wait_for_cmd() which
> immediately returns an error and without the 10 minute hang. So that's
> probably due to the polling-mechanism you mentioned.

Yep, make sense.

>
>>
>>>
>>>
>>> Possibly related, I filed another bug with linux-arm that you can view
>>> here:
>>>
>>> http://lists.infradead.org/pipermail/linux-arm-kernel/2018-October/609008.html
>>>
>>> On boot, it seems the clocks for the mmc have the wrong parents, so it
>>> throws 8 "invalid clk rate" errors, one fore each of the mmc, sdio, etc
>>> clocks.
>>
>>
>> That may be the root cause.
>>
>> But in either case, it starts to look like we should have a reasonable
>> timeout set for the flush command in the reset case. Hanging for ten
>> minutes when trying to recover from an error, doesn't sound very
>> practical to me.
>>
>
> I would agree. Definitely not practical.
> Do you have any suggestions what the parents of the mmc/sdio clocks should
> be or a place to start for me to figure it out the correct values? I tried
> looking for documentations or examples but I'm not finding much.

Sorry, when it comes to the platform configurations, I can't help
much. You need to ping some of rockchip people.

Kind regards
Uffe