RE: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

Avri Altman <Avri.Altman@xxxxxxx> · Mon, 4 Dec 2023 11:31:15 +0000



> 
> On 02/12/23 16:47:23, Avri Altman wrote:
> > Hi,
> > Sorry for joining so late - This thread was routed to my junk mail somehow.
> > We were observing this issue recently with one of our clients using a
> Broadcom platform.
> > Similarly like in this case, the tuning process didn't use cmd21, so sending
> only cmd6 is perfectly ok.
> > We couldn't find any issue with the device at the time.
> > During our investigation, it turned out that the client had a kernel
> > hack of its own, and once it was removed the issue wasn't reproducing
> anymore.
> 
> um, do you know what driver or setting could have be caused the issue?
> 
> This product is using the Xilinx kernel. 5.15.64
> https://github.com/Xilinx/linux-xlnx
It was a Broadcom platform - BCX972127OTT, using sdhci-brcmstb, with a 5.4 kernel.

> 
> >
> > > > > >>> hi Adrian
> > > > > >>>
> > > > > >>> Sure, this is the output of the trace:
> > > > > >>>
> > > > > >>> [  422.018756] mmc0: sdhci: IRQ status 0x00000020 [
> > > > > >>> 422.018789]
> > > > > >>> mmc0: sdhci: IRQ status 0x00000020 [  422.018817] mmc0: sdhci:
> > > > > >>> IRQ status 0x00000020 [  422.018848] mmc0: sdhci: IRQ status
> > > > > >>> 0x00000020 [  422.018875] mmc0: sdhci: IRQ status 0x00000020
> > > > > >>> [ 422.018902] mmc0: sdhci: IRQ status 0x00000020 [
> > > > > >>> 422.018932]
> > > > > >>> mmc0: sdhci: IRQ status 0x00000020 [  422.020013] mmc0: sdhci:
> > > > > >>> IRQ status 0x00000001 [  422.020027] mmc0: sdhci: IRQ status
> > > > > >>> 0x00000002 [  422.020034] mmc0: req done (CMD6): 0: 00000800
> > > > > >>> 00000000 00000000 00000000 [  422.020054] mmc0: starting
> > > > > >>> CMD13 arg 00010000 flags 00000195 [  422.020068] mmc0:
> > > > > >>> sdhci: IRQ status 0x00000001 [  422.020076] mmc0: req done
> (CMD13): 0:
> > > > > >>> 00000900 00000000 00000000 00000000 [  422.020092] <mmc0:
> > > > > >>> starting CMD23 arg 00000001 flags 00000015> [  422.020101]
> mmc0:
> > > > > >>> starting CMD25 arg 00000000 flags 00000035
> > > > > >>> [  422.020108] mmc0:     blksz 512 blocks 1 flags 00000100 tsac 400
> ms
> > > nsac 0
> > > > > >>> [  422.020124] mmc0: sdhci: IRQ status 0x00000001 [
> > > > > >>> 422.021671]
> > > > > >>> mmc0: sdhci: IRQ status 0x00000002 [  422.021691] mmc0: req
> > > > > >>> done
> > > > > >>> <CMD23>: 0: 00000000 00000000 00000000 00000000 [
> > > > > >>> 422.021700]
> > > > > >>> mmc0: req done (CMD25): 0: 00000900 00000000 00000000
> > > 00000000
> > > > > >>> [  422.021708] mmc0:     512 bytes transferred: 0
> > > > > >>> [  422.021728] mmc0: starting CMD13 arg 00010000 flags
> > > > > >>> 00000195 [  422.021743] mmc0: sdhci: IRQ status 0x00000001 [
> > > > > >>> 422.021752]
> > > > > >>> mmc0: req done (CMD13): 0: 00000900 00000000 00000000
> > > 00000000 [
> > > > > >>> 422.021771] <mmc0: starting CMD23 arg 00000001 flags
> > > > > >>> 00000015> [ 422.021779] mmc0: starting CMD18 arg 00000000
> flags 00000035
> > > > > >>> [  422.021785] mmc0:     blksz 512 blocks 1 flags 00000200 tsac 100
> ms
> > > nsac 0
> > > > > >>> [  422.021804] mmc0: sdhci: IRQ status 0x00000001 [
> > > > > >>> 422.022566]
> > > > > >>> mmc0: sdhci: IRQ status 0x00208000
> > > > > >>> <---------------------------------- this doesnt seem right [
> > Why not?
> > Its cmd25-cmd25-cmd18 which implies rpmb write?
> 
> sorry I am referring to the IRQ status  0x00208000 (CRC errors) - not the
> sequence.
> 
> >
> > > > > >>> 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000
> > > 00000000 00000000 [  422.022639] mmc0: req done (CMD18): 0:
> 00000900
> > > 00000000 00000000 00000000
> > > > > >>> [  422.022647] mmc0:     0 bytes transferred: -84 < ---------------------
> ----
> > > -------- it should have transfered 4096 bytes
> > > > > >>> [  422.022669] sdhci-arasan ff160000.mmc:
> __mmc_blk_ioctl_cmd:
> > > > > >>> data error -84 [  422.029619] mmc0: starting CMD6 arg
> > > > > >>> 03b30001 flags 0000049d [  422.029636] mmc0: sdhci: IRQ
> > > > > >>> status 0x00000001 [  422.029652] mmc0: sdhci: IRQ status
> > > > > >>> 0x00000002 [  422.029660]
> > > > > >>> mmc0: req done (CMD6): 0: 00000800 00000000 00000000
> > > > > >>> 00000000
> > > [
> > > > > >>> 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195
> > > > > >>> [ 422.029693] mmc0: sdhci: IRQ status 0x00000001 [
> > > > > >>> 422.029702]
> > > > > >>> mmc0: req done (CMD13): 0: 00000900 00000000 00000000
> > > 00000000 [
> > > > > >>> 422.196996] <mmc0: starting CMD23 arg 00000400 flags
> > > > > >>> 00000015> [ 422.197051] mmc0: starting CMD25 arg 058160e0
> flags 000000b5
> > > > > >>> [  422.197079] mmc0:     blksz 512 blocks 1024 flags 00000100 tsac
> 400
> > > ms nsac 0
> > > > > >>> [  422.197110] mmc0:     CMD12 arg 00000000 flags 0000049d
> > > > > >>> [  422.199455] mmc0: sdhci: IRQ status 0x00000020 [
> > > > > >>> 422.199526]
> > > > > >>> mmc0: sdhci: IRQ status 0x00000020 [  422.199585] mmc0: sdhci:
> > > > > >>> IRQ status 0x00000020 [  422.199641] mmc0: sdhci: IRQ status
> > > > > >>> 0x00000020 [  422.199695] mmc0: sdhci: IRQ status 0x00000020
> > > > > >>> [ 422.199753] mmc0: sdhci: IRQ status 0x00000020 [
> > > > > >>> 422.199811]
> > > > > >>> mmc0: sdhci: IRQ status 0x00000020 [  422.199865] mmc0: sdhci:
> > > > > >>> IRQ status 0x00000020 [  422.199919] mmc0: sdhci: IRQ status
> > > > > >>> 0x00000020 [  422.199972] mmc0: sdhci: IRQ status 0x00000020
> > > > > >>> [ 422.200026] mmc0: sdhci: IRQ status 0x00000020
> > > > > >>>
> > > > > >>>
> > > > > >>> does this help?
> > > > > >
> > > > > > Just asking because it doesn't mean much to me other than the
> > > > > > obvious CRC problem.
> > > > > >
> > > > > > Being this issue so easy to trigger - and to fix - indicates a
> > > > > > problem on the card more than on the algorithm (otherwise
> > > > > > faults would be all over the place). But I am not an expert on this
> area.
> > > > > >
> > > > > > any additional suggestions welcome.
> > > > >
> > > > > My guess is that sometimes tuning produces a "bad" result.
> > > > > Perhaps the margins are very tight and the difference is only 1
> > > > > tap.  When a "bad" result happens in non-RPMB, a CRC error
> > > > > results in re-tuning and retry, so no errors are seen.  When it
> > > > > happens in RPMB, that is not possible, so the error is obvious.
> > > > > Not re-tuning before RPMB switch helps because the
> > > > > CRC-error->re-tuning to a "good" result has probably already
> happened.
> > > > >
> > > > > However,  based on that theory, it is not necessary the eMMC
> > > > > that is at fault.
> > > > >
> > > > > It may be worth considering a stronger eMMC driver strength setting.
> > > >
> > > > sure I can tune the value (just building now). however I am not
> > > > sure about the implications - is there any negative consequence of
> > > > increasing this value that I could monitor (if tests pass)?
> > >
> > > ZynqMP does not set the property "fixed-emmc-driver-type" and since
> > > the sdhci-of-arasan driver does not implement
> > > select_drive_strength() the drive_strength setting is zero.
> > >
> > > So AFAICS things are working accordingly - it is hard for me to say
> > > if things should have been coded any differently.
> > >
> > > > >
> > > > > sdhci supports err_stats in debugfs - that may show how many CRC
> > > > > errors there are when not accessing RPMB.
> > > >
> > > > ok
> > > >
> > > > >
> > > > > I don't object to skipping re-tuning before RPMB switch, but I
> > > > > am not sure about tying it to a specific eMMC.
> > > >
> > > > thanks. will follow up after further testing.
> > >
> > > should I just repost the patch now skiping the retune for all cards
> > > before switching to the RPMB partition? instead of using a quirk?
> > >
> > > On this particular card it has now run for a couple of days so I am
> > > confident that it addresses at the very least the symptom of the issue.
> > As aforesaid, we observed a similar issue on a different platform as well.
> > If it's not realistic to further pursue Adrian's theory, *And* this
> > doesn't reproduce on other cards, we have no objection setting the quirk
> for Sandisk.
> > (if you're having trouble testing other cards ping me privately I can help you
> with that).
> 
> I have some extra eMMC cards which I used to validate RPMB on the OP-TEE
> port I did for AMD/Xilinx Versal ACAP some time ago and which I maintain
> upstream:
> https://optee.readthedocs.io/en/latest/building/devices/versal.html?highlig
> ht=versal
> 
> However I cant use them on this hardware - there is not a uSD slot, just USB.
> And from what I can see, RPMB doesnt get mapped when the device is
> mounted as a mass storage device (unless there is a way that I dont know?).
> Other than that I am not sure what to propose. Suggestions welcome.
Maybe we can help you to replace the soldered device with a socket.
I need my managers to approve this.
Ping me privately if this is a viable option for you.

> 
> Versal uses the same sdhci-of-arasan driver but with some diffences:
> https://xilinx-
> wiki.atlassian.net/wiki/spaces/A/pages/18842090/SD+controller?responseTo
> ken=4bd005c7902a3dbd9ecb032f02e52ccb
> This particular issue can not be reproduced on that platform.
> 
> It also didn't ever trigger in any of the other platforms I have worked with
> and supported during the last four years (STM32MP1, NXP (iMX8, iMX7,
> iMX6), etc). And we have heard of no customers complaining about upgrade
> issues.
> 
> Being RPMB critical for our security story - device firmware verification and
> upgrade - we would have noticed. So this one is the first time we have had
> troubles accessing RPMB - incidentally blocking a product launch and causing
> a bit of pain.
> https://docs.foundries.io/latest/reference-manual/security/secure-
> machines.html?highlight=rpmb#accessing-secure-storage
> 
> We could carry the patch internally (it seems harmless after all the testing
> done) but I'd much rather land it upstream if possible.
Agreed.

Thanks,
Avri
> 
> >
> > Thanks a lot for fixing this,
> > Avri
> 
> thanks everyone for the support.
> 
> >
> > (btw - yes - our manufacturer id is 0x45 - it is set differently in
> > the mmc driver for historic reasons - Thank you for adding this.)
> >
> > >
> > >
> > > >
> > > > >
> > >
> >