Re: [BUG] mmc: dw_mmc*: mmc2: cache flush error -110 hang

Hal Emmerich <hal@xxxxxxxxxxxxxxx> · Sun, 28 Oct 2018 20:43:57 -0500

On 10/26/18 3:25 AM, Ulf Hansson wrote:
On 25 October 2018 at 23:20, Hal Emmerich <hal@xxxxxxxxxxxxxxx> wrote:
Hello mmc people,

When booting the veyron speedy, which uses the dw_mmc driver on kernel 4.19 it hangs for ~10 minutes about 1 in 10 boots.
This also occurs on kernel version 4.17.2.

Do you know if this has been a problem always or is it a regression?
That would be very nice to know.

I do not know for sure. What I can say is that it also occurs on version 4.9.135
I know its not very helpful, but I can also tell you this does not occur in the 
chromeos 3.14 kernel. I've been digging through the differences, but the 
chromeos 3.14 and the mainline 4.19 mmc drivers vary wildly.

Tracing the hang:
if the mmc block system fails to read a sector, mmc_blk_rq_error is called, which calls hw_reset,
which calls _mmc_hw_reset in /drivers/mmc/core/mmc.c,
which finally calls
mmc_flush_cache(host->card) which hangs for ~10 minutes, before failing and resetting the emmc.

If the call to mmc_flush_cache(host->card) is commented out, the hang no longer happens.

Well, honestly the call to mmc_flush_cache() can be discussed. I
wonder if it ever have work, without errors. The reason to why I think
so, is simply because the card is in an unknown state - likely not
being able to accept a flush request anyway.

On the other hand, hanging for ~10 minutes sounds like a
controller/driver problem, this should not happen, no matter what.

The errors printed after it finally recovers are:
[  602.188052] mmc2: cache flush error -110
[  602.690672] dwmmc_rockchip ff0f0000.dwmmc: Busy; trying anyway
[  603.193323] mmc_host mmc2: Timeout sending command (cmd 0x202000 arg 0x0 status 0x80202000)

The first is printed by mmc_flush_cache, and the second two are from the second half of __mmc_hw_reset,
when it re inits the emmc.

Could this be due to incorrect clocks?

Perhaps. Or that the driver/controller is in some error state, after
the failed I/O request, which means that it fails to serve any request
properly.

There is a couple of things I would have tried.

1. Try using the mmc_test driver and verify that the hw_reset test
works. This means you will be running the test, when the
controller/card are in good conditions.

The hw_reset test works every time. I ran it in a loop over 200 times and it 
never failed. I also verified the hardware reset test calls __mmc_hw_reset() 
(not some other weird reset function) and thus mmc_flush_cache()

2. If 1) works, repeat the failure sequence you described above (don't
use mmc_test no more), but replace mmc_flush_cache() in
_mmc_hw_reset() with some other commands (try both R1 and R1B
responses) and see what happens. None of the commands should hang.

This should tell us more.

When I replaced the call to mmc_flush_cache() with a call to mmc_wait_for_cmd() 
and tried cmd.flags = MMC_RSP_R1B and with cmd.flags = MMC_RSP_R1. It booted 
properly every time.

Let me know if that isn't the proper way to sent R1 and R1B responses.

Earlier, I also tried replacing the call to mmc_flush_cache() with the command 
to turn off the cache and turn it on again, as the kernel used to flush the 
cache by just turning it off then on again. This resulted in the same error.

Possibly related, I filed another bug with linux-arm that you can view here: 
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-October/609008.html

On boot, it seems the clocks for the mmc have the wrong parents, so it throws 8 
"invalid clk rate" errors, one fore each of the mmc, sdio, etc clocks.

Thanks for you expertise,
Hal

Kind regards
Uffe