Re: sdhci-omap: additional PM issue since 5.16

Romain Naour <romain.naour@xxxxxxxx> · Mon, 27 Jan 2025 10:06:05 +0100

Hello David, All,

Le 24/01/2025 à 19:49, David a écrit :
> 
> On 1/24/25 11:15, David Owens wrote:
>> Hi Romain
>>
>> On 1/24/25 04:36, Romain Naour wrote:
>>> Hello David,
>>>
>>> Le 23/01/2025 à 23:09, David Owens a écrit :
>>>> Hello,
>>>>
>>>> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38.  The eMMC is using mmc-hs200 powered at 1.8v.  Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads.  With a delay between reads, the read will occasionally (~50% of the time) return garbage data.  Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0.  The same thing happens when reading from /dev/mmcblk1boot1.
>>>>
>>>> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2].  Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue.  Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0.  I also don't see the same I/O errors mentioned in the previous posting.  Reads always succeed and return the correct amount of data, its just from the wrong device.
>>> Interesting, can you share a test script to reproduce your issue?
>> Here is a test script I've been running on my devices.  A failure is typically
>> detected after a minute or two.  I include the eMMC part type in the output as
>> we've used a couple different parts in production, all claiming to be compatible
>> and I'm starting to wonder if the failure is a combination of the aggressive
>> PM _and_ specific emmc parts.  The offset used in hexdump was just a place in
>> both mmcblk1 and mmcblk1boot0 that was non-zero.  The issue happens using any
>> offset.
>>
>> #!/bin/bash
>>
>> echo "Kernel:    $(uname -r)"
>> echo "eMMC part: $(dmesg | grep 'mmcblk1: mmc1:0001' | awk '{print $5}')"
>> BLK1=$(hexdump -C /dev/mmcblk1 -s 0x3fc000 -n 10 | head -n 1)
>> BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)
>>
>> echo "/dev/mmcblk1:      ${BLK1}"
>> echo "/dev/mmcblk1boot0: ${BOOT}"
>>
>> while [[ "$BLK1" != "$BOOT" ]]; do
>>     sleep 20
>>     BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)
>>     echo "/dev/mmcblk1boot0: ${BOOT}"
>> done
>>
>> echo "/dev/mmcblk1boot0 read failure"
>>
>>> Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already.
>>> Can you test with the latest stable release?
>> Good question.  I can certainly update to .127 but at the time we were shipping
>> units we were on .38 so that's where I've been doing all my testing.  I'll let
>> you know how running under .127 compares.
> 
> Testing with 6.1.127 shows the same behavior.

Thanks for testing.

I'm able to reproduce the issue locally (using a kernel 6.1.112).
It fail after the first sleep 20...

If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone.

About sdhci-omap driver, It's one of the only few enabling
MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC
but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM.

I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for
HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to
sdhci-omap driver to support SDIO WLAN device PM [1].

I've found another similar report on the Beaglebone-black (AM335x SoC) [2].

It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards.

The TRM (SoC manual) says that "Suspend-Resume Flow" is only supported for SDIO
cards:

  26.5.1.2.1.6 Suspend-Resume Flow
    The suspend-and-resume feature is supported only by SDIO cards.

Thoughts?

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442

[2]
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1332523/beagl-bone-black-problems-reading-from-emmc-boot-partitions-on-beaglebone-with-kernel-6-1

Best regards,
Romain

> 
>>> I believe this issue could be reproduced on the beaglebone-ai board (I don't
>>> have it).
>>>
>>> [1] https://www.beagleboard.org/boards/beaglebone-ai
>> Thanks for the suggestion, I'll see if I can dig one up.
>>
>>> Best regards,
>>> Romain
>>>
>>>
>>>> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@xxxxxxxx/
>>>>
>>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
>>>>
>>>> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@xxxxxxxx/T/#u
>>>>
>>>> Regards,
>>>>
>>>> Dave
>>>>
> Thanks,
> 
> Dave
>