Re: [BUG] mmc: dw_mmc*: mmc2: cache flush error -110 hang

Ulf Hansson <ulf.hansson@xxxxxxxxxx> · Mon, 29 Oct 2018 11:59:06 +0100

On 29 October 2018 at 02:43, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote:
>
>
> On 10/26/18 3:25 AM, Ulf Hansson wrote:
>>
>> On 25 October 2018 at 23:20, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote:
>>>
>>> Hello mmc people,
>>>
>>> When booting the veyron speedy, which uses the dw_mmc driver on kernel
>>> 4.19 it hangs for ~10 minutes about 1 in 10 boots.
>>> This also occurs on kernel version 4.17.2.
>>
>>
>> Do you know if this has been a problem always or is it a regression?
>> That would be very nice to know.
>
>
> I do not know for sure. What I can say is that it also occurs on version
> 4.9.135
> I know its not very helpful, but I can also tell you this does not occur in
> the chromeos 3.14 kernel. I've been digging through the differences, but the
> chromeos 3.14 and the mainline 4.19 mmc drivers vary wildly.

Yep, it's probably not worth an investigation then.

>
>
>
>>> Tracing the hang:
>>> if the mmc block system fails to read a sector, mmc_blk_rq_error is
>>> called, which calls hw_reset,
>>> which calls _mmc_hw_reset in /drivers/mmc/core/mmc.c,
>>> which finally calls
>>> mmc_flush_cache(host->card) which hangs for ~10 minutes, before failing
>>> and resetting the emmc.
>>>
>>> If the call to mmc_flush_cache(host->card) is commented out, the hang no
>>> longer happens.
>>
>>
>> Well, honestly the call to mmc_flush_cache() can be discussed. I
>> wonder if it ever have work, without errors. The reason to why I think
>> so, is simply because the card is in an unknown state - likely not
>> being able to accept a flush request anyway.
>>
>> On the other hand, hanging for ~10 minutes sounds like a
>> controller/driver problem, this should not happen, no matter what.
>>
>>>
>>>
>>> The errors printed after it finally recovers are:
>>> [  602.188052] mmc2: cache flush error -110
>>> [  602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway
>>> [  603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg
>>> 0x0 status 0x80202000)
>>>
>>> The first is printed by mmc_flush_cache, and the second two are from the
>>> second half of __mmc_hw_reset,
>>> when it re inits the emmc.
>>>
>>> Could this be due to incorrect clocks?
>>
>>
>> Perhaps. Or that the driver/controller is in some error state, after
>> the failed I/O request, which means that it fails to serve any request
>> properly.
>>
>> There is a couple of things I would have tried.
>>
>> 1. Try using the mmc_test driver and verify that the hw_reset test
>> works. This means you will be running the test, when the
>> controller/card are in good conditions.
>
>
> The hw_reset test works every time. I ran it in a loop over 200 times and it
> never failed. I also verified the hardware reset test calls __mmc_hw_reset()
> (not some other weird reset function) and thus mmc_flush_cache()

Great!

>
>>
>> 2. If 1) works, repeat the failure sequence you described above (don't
>> use mmc_test no more), but replace mmc_flush_cache() in
>> _mmc_hw_reset() with some other commands (try both R1 and R1B
>> responses) and see what happens. None of the commands should hang.
>>
>> This should tell us more.
>
>
> When I replaced the call to mmc_flush_cache() with a call to
> mmc_wait_for_cmd() and tried cmd.flags = MMC_RSP_R1B and with cmd.flags =
> MMC_RSP_R1. It booted properly every time.
>
> Let me know if that isn't the proper way to sent R1 and R1B responses.
>
> Earlier, I also tried replacing the call to mmc_flush_cache() with the
> command to turn off the cache and turn it on again, as the kernel used to
> flush the cache by just turning it off then on again. This resulted in the
> same error.

This more and more sounds like the card have entered an unknown state.
Let me elaborate on that analyze.

dw_mmc does not support MMC_CAP_WAIT_WHILE_BUSY, but implements
->card_busy(). These things are used from __mmc_switch() (called when
turning off the cache to flush it), to understand when the card has
completed its internal operations of flushing the cache. The card
signals busy by asserting DAT0 - and for some reason it seems like it
never stops signalling busy.

Unfortunate, the eMMC spec doesn't define a timeout for the flush
cache command, which means in mmc_poll_for_busy() we pick a default
value of MMC_OPS_TIMEOUT_MS (10 minutes!).

I am guessing this what is happening when mmc_flush_cache() is called
from __mmc_hw_reset().

Now, could you elaborate a bit more what is happening in the case when
you send the two commands (turn off change and turn on cache) manually
by using mmc_wait_for_cmd()? Which one of them hangs and for how long?

I am guessing it may be the second command, but that the timeout you
encounter is different than 10 minutes, as it's instead caused by
dw_mmc internals, as it has a polling-mechanism to avoid sending a new
command if the card signals busy. Let's see.

>
>
> Possibly related, I filed another bug with linux-arm that you can view here:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2018-October/609008.html
>
> On boot, it seems the clocks for the mmc have the wrong parents, so it
> throws 8 "invalid clk rate" errors, one fore each of the mmc, sdio, etc
> clocks.

That may be the root cause.

But in either case, it starts to look like we should have a reasonable
timeout set for the flush command in the reset case. Hanging for ten
minutes when trying to recover from an error, doesn't sound very
practical to me.

>
> Thanks for you expertise,
> Hal
>
>>
>> Kind regards
>> Uffe
>>
>

Kind regards
Uffe