RE: [PATCH] mmc: block: add reset workaround for partition switch failures

Avri Altman <Avri.Altman@xxxxxxxxxxx> · Mon, 3 Mar 2025 08:51:00 +0000

> Hello,
> >> Some eMMC devices (e.g., BGSD4R and AIM20F) may enter an
> unresponsive
> >> state after encountering CRC errors during RPMB writes (CMD25). This
> >> prevents the device from switching back to the main partition via
> >> CMD6, blocking further I/O operations.
> >Different cards on the same platform?
> >Can you share which platform, and few lines from the log supporting your
> analysis?
> 
> I tested on R-Car Gen3/4 platforms, which use the same host controller IP and
> the tmio_mmc host driver.
> The tests were conducted on different board and eMMC combinations:
> - Gen3 Board with Samsung eMMC (BGSD4R) → Issue observed
> - Gen3 Board with Micron eMMC (AIM20F, new version) → Issue observed
> - Gen3 Board with Micron eMMC (AIM20F, old version) → No issue
> - Gen4 Board with Micron eMMC (G1M15L) → No issue
> 
> The issue only occurs in the RPMB partition during write operations, where a
> CRC error is triggered.
> To investigate further, I hacked the host driver to generate a dummy CRC
> during the CMD25 data phase.
> The reproduced log is as follows:
> $ ./mmc rpmb read-counter /dev/mmcblk0rpmb
> [   75.557848] w_t: -->START_CMD6 (arg: 3b30301)
> [   75.557863] w_t:    resp[0]=900
> [   75.557875] w_t: -->START_CMD13 (arg: 10000)
> [   75.557884] w_t:    resp[0]=900
> [   75.557894] w_t: -->START_CMD23 (arg: 1)
> [   75.557903] w_t:    resp[0]=900
> [   75.557915] w_t: -->START_CMD25 (arg: 0)
> [   75.557924] w_t:    resp[0]=900
> [   75.557931] !!!!!!!!!!!!!!!!, make a dummy write CRC on DAT
> [   75.563631] w_t: (data_err) -84 stat=20820604 error=5800 (which means
> eMMC device feedbacked nagative CRC status)
> [   75.563672] renesas_sdhi_internal_dmac ee140000.sd:
> __mmc_blk_ioctl_cmd: data error -84
> [   75.573112] w_t: -->START_CMD6 (arg: 3b30001)
> [   75.573132] w_t: (cmd_err -110) stat=20c00401 error=12000
> [   75.573154] w_t: -->START_CMD6 (arg: 3b30001)
> [   75.573169] w_t: (cmd_err -110) stat=20c00401 error=12000
> [   75.573183] w_t: -->START_CMD6 (arg: 3b30001)
> [   75.573197] w_t: (cmd_err -110) stat=20c00401 error=12000
> [   75.573211] w_t: -->START_CMD6 (arg: 3b30001)
> [   75.573225] w_t: (cmd_err -110) stat=20c00401 error=12000
> After this issue occurs, the eMMC device no longer responds to CMD6, even
> subsequent accesses to the main partition proceed abnormally.
> However, if we perform an eMMC card reset at this point, the retry of CMD6
> works as expected.
Thank you for sharing it.

> 
> BTW,
> I now believe that sending CMD12 is a better solution in this case rather than
> performing a reset.
> According to information from the eMMC vendor, even in a closed-end write
> operation (CMD23 + CMD25), CMD12 is required if any communication error
> occurs.
> The JESD84 specification also mentions a similar requirement: "A stop
> command is not required at the end of this type of multiple block write unless
> terminated with an error."
> I just simply tested this approach on the affected board, and it can work
> successfully.
OK.
Please note that some host controllers do that as auto-cmd.

> 
> >>
> >> The root cause is suspected to be a firmware/hardware issue in
> >> specific eMMC models. A workaround is to perform a hardware reset via
> >> mmc_hw_reset()
> >> when the partition switch fails, followed by a retry.
> >Same fw bug in 2 different products?
> >
> >Why do we need to fix it here?
> >The ioctl will eventually return an error, and reset is needed anyway.
> >If the eMMC is the primary storage,  the platform is rebooting without being
> aware what went wrong.
> 
> In the main partition, a similar reset operation is already implemented in
> mmc_blk_issue_rw_rq(), So I believe applying the same approach for RPMB
> should be acceptable.
> 		case MMC_BLK_ABORT:
> 			if (!mmc_blk_reset(md, card->host, type))
> 				break;
> 			mmc_blk_rw_cmd_abort(mq, card, old_req, mq_rq);
> 			mmc_blk_rw_try_restart(mq, new_req, mqrq_cur);
> 			return;
The code that you are citing does no longer exist.
It was removed a while ago - see https://lore.kernel.org/linux-block/1511962879-24262-23-git-send-email-adrian.hunter@xxxxxxxxx/

My point is that you are recovering silently on an ioctl error that is better for the sender to be aware of and recover by himself.

Thanks,
Avri

> 
> 
> Best Regards,
> Guan Wang