Re: [BUG] mmc: dw_mmc*: mmc2: cache flush error -110 hang

Hal Emmerich <hal@xxxxxxxxxxxxxxx> · Tue, 30 Oct 2018 00:48:09 -0500

On 10/29/18 5:59 AM, Ulf Hansson wrote:
On 29 October 2018 at 02:43, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote:

On 10/26/18 3:25 AM, Ulf Hansson wrote:

On 25 October 2018 at 23:20, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote:

Hello mmc people,

When booting the veyron speedy, which uses the dw_mmc driver on kernel
4.19 it hangs for ~10 minutes about 1 in 10 boots.
This also occurs on kernel version 4.17.2.

Do you know if this has been a problem always or is it a regression?
That would be very nice to know.

I do not know for sure. What I can say is that it also occurs on version
4.9.135
I know its not very helpful, but I can also tell you this does not occur in
the chromeos 3.14 kernel. I've been digging through the differences, but the
chromeos 3.14 and the mainline 4.19 mmc drivers vary wildly.

Yep, it's probably not worth an investigation then.

Tracing the hang:
if the mmc block system fails to read a sector, mmc_blk_rq_error is
called, which calls hw_reset,
which calls _mmc_hw_reset in /drivers/mmc/core/mmc.c,
which finally calls
mmc_flush_cache(host->card) which hangs for ~10 minutes, before failing
and resetting the emmc.

If the call to mmc_flush_cache(host->card) is commented out, the hang no
longer happens.

Well, honestly the call to mmc_flush_cache() can be discussed. I
wonder if it ever have work, without errors. The reason to why I think
so, is simply because the card is in an unknown state - likely not
being able to accept a flush request anyway.

On the other hand, hanging for ~10 minutes sounds like a
controller/driver problem, this should not happen, no matter what.

The errors printed after it finally recovers are:
[  602.188052] mmc2: cache flush error -110
[  602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway
[  603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg
0x0 status 0x80202000)

The first is printed by mmc_flush_cache, and the second two are from the
second half of __mmc_hw_reset,
when it re inits the emmc.

Could this be due to incorrect clocks?

Perhaps. Or that the driver/controller is in some error state, after
the failed I/O request, which means that it fails to serve any request
properly.

There is a couple of things I would have tried.

1. Try using the mmc_test driver and verify that the hw_reset test
works. This means you will be running the test, when the
controller/card are in good conditions.

The hw_reset test works every time. I ran it in a loop over 200 times and it
never failed. I also verified the hardware reset test calls __mmc_hw_reset()
(not some other weird reset function) and thus mmc_flush_cache()

Great!

2. If 1) works, repeat the failure sequence you described above (don't
use mmc_test no more), but replace mmc_flush_cache() in
_mmc_hw_reset() with some other commands (try both R1 and R1B
responses) and see what happens. None of the commands should hang.

This should tell us more.

When I replaced the call to mmc_flush_cache() with a call to
mmc_wait_for_cmd() and tried cmd.flags = MMC_RSP_R1B and with cmd.flags =
MMC_RSP_R1. It booted properly every time.

Let me know if that isn't the proper way to sent R1 and R1B responses.

Earlier, I also tried replacing the call to mmc_flush_cache() with the
command to turn off the cache and turn it on again, as the kernel used to
flush the cache by just turning it off then on again. This resulted in the
same error.

This more and more sounds like the card have entered an unknown state.
Let me elaborate on that analyze.

dw_mmc does not support MMC_CAP_WAIT_WHILE_BUSY, but implements
->card_busy(). These things are used from __mmc_switch() (called when
turning off the cache to flush it), to understand when the card has
completed its internal operations of flushing the cache. The card
signals busy by asserting DAT0 - and for some reason it seems like it
never stops signalling busy.

Unfortunate, the eMMC spec doesn't define a timeout for the flush
cache command, which means in mmc_poll_for_busy() we pick a default
value of MMC_OPS_TIMEOUT_MS (10 minutes!).

I am guessing this what is happening when mmc_flush_cache() is called
from __mmc_hw_reset().

That seems correct correct. I somehow overlooked the fact that this

[  602.187067] mmc2: Card stuck being busy! mmc_poll_for_busy

is printed immediately before

[  602.188052] mmc2: cache flush error -110
[  602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway
[  603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg
0x0 status 0x80202000)

 which basically confirms that. Sorry about missing that.

Additionally, I found that these lines

[  602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway
[  603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg
0x0 status 0x80202000)

are from _mmc_hw_reset() calling mmc_init_card() which calls some functions that 
eventually call mci_send_cmd().

_mmc_hw_reset() calls mmc_init_card() after the power cycle is complete.

Since that command is sent in the process of re-init-ing the card _after_ the 
power cycle/ power reset, I believe reset is not functioning properly with this 
card.

Maybe some commands are sent to the emmc by the bootloader on a normal boot that 
the kernel mmc init process relies on for this card?

Now, could you elaborate a bit more what is happening in the case when
you send the two commands (turn off change and turn on cache) manually
by using mmc_wait_for_cmd()? Which one of them hangs and for how long?

I am guessing it may be the second command, but that the timeout you
encounter is different than 10 minutes, as it's instead caused by
dw_mmc internals, as it has a polling-mechanism to avoid sending a new
command if the card signals busy. Let's see.

I used
err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
					EXT_CSD_CACHE_CTRL, 0, timeout);

to turn off the cache, and then swapped the 0 for a 1 to turn it back on. This 
is how the mmc_cache_control function did it in 3.14. The comment on the 
function says that turning the cache off shall trigger flushing the cache.

The initial command to turn off the cache causes the 10 minute hang.
The command to then turn it back on calls mmc_wait_for_cmd() which immediately 
returns an error and without the 10 minute hang. So that's probably due to the 
polling-mechanism you mentioned.

Possibly related, I filed another bug with linux-arm that you can view here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-October/609008.html

On boot, it seems the clocks for the mmc have the wrong parents, so it
throws 8 "invalid clk rate" errors, one fore each of the mmc, sdio, etc
clocks.

That may be the root cause.

But in either case, it starts to look like we should have a reasonable
timeout set for the flush command in the reset case. Hanging for ten
minutes when trying to recover from an error, doesn't sound very
practical to me.

I would agree. Definitely not practical.
Do you have any suggestions what the parents of the mmc/sdio clocks should be or 
a place to start for me to figure it out the correct values? I tried looking for 
documentations or examples but I'm not finding much.

Thanks for you expertise,
Hal

Kind regards
Uffe

Kind regards
Uffe

Thanks for all of your help,
Hal