> > On 02/12/23 16:47:23, Avri Altman wrote: > > Hi, > > Sorry for joining so late - This thread was routed to my junk mail somehow. > > We were observing this issue recently with one of our clients using a > Broadcom platform. > > Similarly like in this case, the tuning process didn't use cmd21, so sending > only cmd6 is perfectly ok. > > We couldn't find any issue with the device at the time. > > During our investigation, it turned out that the client had a kernel > > hack of its own, and once it was removed the issue wasn't reproducing > anymore. > > um, do you know what driver or setting could have be caused the issue? > > This product is using the Xilinx kernel. 5.15.64 > https://github.com/Xilinx/linux-xlnx It was a Broadcom platform - BCX972127OTT, using sdhci-brcmstb, with a 5.4 kernel. > > > > > > > > >>> hi Adrian > > > > > >>> > > > > > >>> Sure, this is the output of the trace: > > > > > >>> > > > > > >>> [ 422.018756] mmc0: sdhci: IRQ status 0x00000020 [ > > > > > >>> 422.018789] > > > > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.018817] mmc0: sdhci: > > > > > >>> IRQ status 0x00000020 [ 422.018848] mmc0: sdhci: IRQ status > > > > > >>> 0x00000020 [ 422.018875] mmc0: sdhci: IRQ status 0x00000020 > > > > > >>> [ 422.018902] mmc0: sdhci: IRQ status 0x00000020 [ > > > > > >>> 422.018932] > > > > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.020013] mmc0: sdhci: > > > > > >>> IRQ status 0x00000001 [ 422.020027] mmc0: sdhci: IRQ status > > > > > >>> 0x00000002 [ 422.020034] mmc0: req done (CMD6): 0: 00000800 > > > > > >>> 00000000 00000000 00000000 [ 422.020054] mmc0: starting > > > > > >>> CMD13 arg 00010000 flags 00000195 [ 422.020068] mmc0: > > > > > >>> sdhci: IRQ status 0x00000001 [ 422.020076] mmc0: req done > (CMD13): 0: > > > > > >>> 00000900 00000000 00000000 00000000 [ 422.020092] <mmc0: > > > > > >>> starting CMD23 arg 00000001 flags 00000015> [ 422.020101] > mmc0: > > > > > >>> starting CMD25 arg 00000000 flags 00000035 > > > > > >>> [ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 > ms > > > nsac 0 > > > > > >>> [ 422.020124] mmc0: sdhci: IRQ status 0x00000001 [ > > > > > >>> 422.021671] > > > > > >>> mmc0: sdhci: IRQ status 0x00000002 [ 422.021691] mmc0: req > > > > > >>> done > > > > > >>> <CMD23>: 0: 00000000 00000000 00000000 00000000 [ > > > > > >>> 422.021700] > > > > > >>> mmc0: req done (CMD25): 0: 00000900 00000000 00000000 > > > 00000000 > > > > > >>> [ 422.021708] mmc0: 512 bytes transferred: 0 > > > > > >>> [ 422.021728] mmc0: starting CMD13 arg 00010000 flags > > > > > >>> 00000195 [ 422.021743] mmc0: sdhci: IRQ status 0x00000001 [ > > > > > >>> 422.021752] > > > > > >>> mmc0: req done (CMD13): 0: 00000900 00000000 00000000 > > > 00000000 [ > > > > > >>> 422.021771] <mmc0: starting CMD23 arg 00000001 flags > > > > > >>> 00000015> [ 422.021779] mmc0: starting CMD18 arg 00000000 > flags 00000035 > > > > > >>> [ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 > ms > > > nsac 0 > > > > > >>> [ 422.021804] mmc0: sdhci: IRQ status 0x00000001 [ > > > > > >>> 422.022566] > > > > > >>> mmc0: sdhci: IRQ status 0x00208000 > > > > > >>> <---------------------------------- this doesnt seem right [ > > Why not? > > Its cmd25-cmd25-cmd18 which implies rpmb write? > > sorry I am referring to the IRQ status 0x00208000 (CRC errors) - not the > sequence. > > > > > > > > >>> 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000 > > > 00000000 00000000 [ 422.022639] mmc0: req done (CMD18): 0: > 00000900 > > > 00000000 00000000 00000000 > > > > > >>> [ 422.022647] mmc0: 0 bytes transferred: -84 < --------------------- > ---- > > > -------- it should have transfered 4096 bytes > > > > > >>> [ 422.022669] sdhci-arasan ff160000.mmc: > __mmc_blk_ioctl_cmd: > > > > > >>> data error -84 [ 422.029619] mmc0: starting CMD6 arg > > > > > >>> 03b30001 flags 0000049d [ 422.029636] mmc0: sdhci: IRQ > > > > > >>> status 0x00000001 [ 422.029652] mmc0: sdhci: IRQ status > > > > > >>> 0x00000002 [ 422.029660] > > > > > >>> mmc0: req done (CMD6): 0: 00000800 00000000 00000000 > > > > > >>> 00000000 > > > [ > > > > > >>> 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195 > > > > > >>> [ 422.029693] mmc0: sdhci: IRQ status 0x00000001 [ > > > > > >>> 422.029702] > > > > > >>> mmc0: req done (CMD13): 0: 00000900 00000000 00000000 > > > 00000000 [ > > > > > >>> 422.196996] <mmc0: starting CMD23 arg 00000400 flags > > > > > >>> 00000015> [ 422.197051] mmc0: starting CMD25 arg 058160e0 > flags 000000b5 > > > > > >>> [ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac > 400 > > > ms nsac 0 > > > > > >>> [ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d > > > > > >>> [ 422.199455] mmc0: sdhci: IRQ status 0x00000020 [ > > > > > >>> 422.199526] > > > > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.199585] mmc0: sdhci: > > > > > >>> IRQ status 0x00000020 [ 422.199641] mmc0: sdhci: IRQ status > > > > > >>> 0x00000020 [ 422.199695] mmc0: sdhci: IRQ status 0x00000020 > > > > > >>> [ 422.199753] mmc0: sdhci: IRQ status 0x00000020 [ > > > > > >>> 422.199811] > > > > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.199865] mmc0: sdhci: > > > > > >>> IRQ status 0x00000020 [ 422.199919] mmc0: sdhci: IRQ status > > > > > >>> 0x00000020 [ 422.199972] mmc0: sdhci: IRQ status 0x00000020 > > > > > >>> [ 422.200026] mmc0: sdhci: IRQ status 0x00000020 > > > > > >>> > > > > > >>> > > > > > >>> does this help? > > > > > > > > > > > > Just asking because it doesn't mean much to me other than the > > > > > > obvious CRC problem. > > > > > > > > > > > > Being this issue so easy to trigger - and to fix - indicates a > > > > > > problem on the card more than on the algorithm (otherwise > > > > > > faults would be all over the place). But I am not an expert on this > area. > > > > > > > > > > > > any additional suggestions welcome. > > > > > > > > > > My guess is that sometimes tuning produces a "bad" result. > > > > > Perhaps the margins are very tight and the difference is only 1 > > > > > tap. When a "bad" result happens in non-RPMB, a CRC error > > > > > results in re-tuning and retry, so no errors are seen. When it > > > > > happens in RPMB, that is not possible, so the error is obvious. > > > > > Not re-tuning before RPMB switch helps because the > > > > > CRC-error->re-tuning to a "good" result has probably already > happened. > > > > > > > > > > However, based on that theory, it is not necessary the eMMC > > > > > that is at fault. > > > > > > > > > > It may be worth considering a stronger eMMC driver strength setting. > > > > > > > > sure I can tune the value (just building now). however I am not > > > > sure about the implications - is there any negative consequence of > > > > increasing this value that I could monitor (if tests pass)? > > > > > > ZynqMP does not set the property "fixed-emmc-driver-type" and since > > > the sdhci-of-arasan driver does not implement > > > select_drive_strength() the drive_strength setting is zero. > > > > > > So AFAICS things are working accordingly - it is hard for me to say > > > if things should have been coded any differently. > > > > > > > > > > > > > sdhci supports err_stats in debugfs - that may show how many CRC > > > > > errors there are when not accessing RPMB. > > > > > > > > ok > > > > > > > > > > > > > > I don't object to skipping re-tuning before RPMB switch, but I > > > > > am not sure about tying it to a specific eMMC. > > > > > > > > thanks. will follow up after further testing. > > > > > > should I just repost the patch now skiping the retune for all cards > > > before switching to the RPMB partition? instead of using a quirk? > > > > > > On this particular card it has now run for a couple of days so I am > > > confident that it addresses at the very least the symptom of the issue. > > As aforesaid, we observed a similar issue on a different platform as well. > > If it's not realistic to further pursue Adrian's theory, *And* this > > doesn't reproduce on other cards, we have no objection setting the quirk > for Sandisk. > > (if you're having trouble testing other cards ping me privately I can help you > with that). > > I have some extra eMMC cards which I used to validate RPMB on the OP-TEE > port I did for AMD/Xilinx Versal ACAP some time ago and which I maintain > upstream: > https://optee.readthedocs.io/en/latest/building/devices/versal.html?highlig > ht=versal > > However I cant use them on this hardware - there is not a uSD slot, just USB. > And from what I can see, RPMB doesnt get mapped when the device is > mounted as a mass storage device (unless there is a way that I dont know?). > Other than that I am not sure what to propose. Suggestions welcome. Maybe we can help you to replace the soldered device with a socket. I need my managers to approve this. Ping me privately if this is a viable option for you. > > Versal uses the same sdhci-of-arasan driver but with some diffences: > https://xilinx- > wiki.atlassian.net/wiki/spaces/A/pages/18842090/SD+controller?responseTo > ken=4bd005c7902a3dbd9ecb032f02e52ccb > This particular issue can not be reproduced on that platform. > > It also didn't ever trigger in any of the other platforms I have worked with > and supported during the last four years (STM32MP1, NXP (iMX8, iMX7, > iMX6), etc). And we have heard of no customers complaining about upgrade > issues. > > Being RPMB critical for our security story - device firmware verification and > upgrade - we would have noticed. So this one is the first time we have had > troubles accessing RPMB - incidentally blocking a product launch and causing > a bit of pain. > https://docs.foundries.io/latest/reference-manual/security/secure- > machines.html?highlight=rpmb#accessing-secure-storage > > We could carry the patch internally (it seems harmless after all the testing > done) but I'd much rather land it upstream if possible. Agreed. Thanks, Avri > > > > > Thanks a lot for fixing this, > > Avri > > thanks everyone for the support. > > > > > (btw - yes - our manufacturer id is 0x45 - it is set differently in > > the mmc driver for historic reasons - Thank you for adding this.) > > > > > > > > > > > > > > > > > > > > > >