Re: sdhci-omap: additional PM issue since 5.16

David Owens <daowens01@xxxxxxxxx> · Fri, 24 Jan 2025 11:15:10 -0600

Hi Romain

On 1/24/25 04:36, Romain Naour wrote:
> Hello David,
>
> Le 23/01/2025 à 23:09, David Owens a écrit :
>> Hello,
>>
>> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38.  The eMMC is using mmc-hs200 powered at 1.8v.  Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads.  With a delay between reads, the read will occasionally (~50% of the time) return garbage data.  Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0.  The same thing happens when reading from /dev/mmcblk1boot1.
>>
>> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2].  Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue.  Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0.  I also don't see the same I/O errors mentioned in the previous posting.  Reads always succeed and return the correct amount of data, its just from the wrong device.
> Interesting, can you share a test script to reproduce your issue?

Here is a test script I've been running on my devices.  A failure is typically
detected after a minute or two.  I include the eMMC part type in the output as
we've used a couple different parts in production, all claiming to be compatible
and I'm starting to wonder if the failure is a combination of the aggressive
PM _and_ specific emmc parts.  The offset used in hexdump was just a place in
both mmcblk1 and mmcblk1boot0 that was non-zero.  The issue happens using any
offset.

#!/bin/bash

echo "Kernel:    $(uname -r)"
echo "eMMC part: $(dmesg | grep 'mmcblk1: mmc1:0001' | awk '{print $5}')"
BLK1=$(hexdump -C /dev/mmcblk1 -s 0x3fc000 -n 10 | head -n 1)
BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)

echo "/dev/mmcblk1:      ${BLK1}"
echo "/dev/mmcblk1boot0: ${BOOT}"

while [[ "$BLK1" != "$BOOT" ]]; do
    sleep 20
    BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)
    echo "/dev/mmcblk1boot0: ${BOOT}"
done

echo "/dev/mmcblk1boot0 read failure"

>
> Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already.
> Can you test with the latest stable release?

Good question.  I can certainly update to .127 but at the time we were shipping
units we were on .38 so that's where I've been doing all my testing.  I'll let
you know how running under .127 compares.

>
> I believe this issue could be reproduced on the beaglebone-ai board (I don't
> have it).
>
> [1] https://www.beagleboard.org/boards/beaglebone-ai

Thanks for the suggestion, I'll see if I can dig one up.

>
> Best regards,
> Romain
>
>
>> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@xxxxxxxx/
>>
>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
>>
>> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@xxxxxxxx/T/#u
>>
>> Regards,
>>
>> Dave
>>
>>