Need help understanding sporadic errors on OMAP35x YAFFS/NAND system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've got a sporadic problem that I'm seeing using NAND/YAFFS on a
Logic LV SOM using a 1928 block YAFFS filesystem.

I've got the 2.6.32 kernel (L23 Poky from
http://www.omappedia.org/wiki/OMAP_Poky) up and running, and
sporadically in testing I
observe an error where 0xff30 shows up in the data read back from the
file - looks somewhat
similar to: http://www.mail-archive.com/linux-omap@xxxxxxxxxxxxxxx/msg23103.html

Testing involves using "dd if=/dev/zero of=/mnt/yaffs/<file> bs=1
seek=30M count=0" to create a 30MB file of
zeros and then copies the file around on the flash, md5sum, syncing,
etc to thrash the cache.

The error I'm seeing is that when I read the file back, its md5sum
does not match that of what a 30MB file of zeros should generate.
To verify, I copy the file from the NAND to a temporary file in RAM,
then md5sum that file and if the md5sum mimsmatches, then I hexdump
the file to see where the data mismatches. This all runs fin in my
test shell script, except after a while (somewhere around 30+GB read
from NAND), I see:

somefile.7: mismatch 666896a98683a364c10aeba0649f119c != 281ed1d5ae50e8419f9b978
aab16de83
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1107800 ff30 ff30 ff30 ff30 ff30 ff30 ff30 ff30
*
1107a00 0000 0000 0000 0000 0000 0000 0000 0000
*
1e00000

instead of the zeros I'd expect.  Originally I thought the problem was
in the NAND where somehow the driver tried to read a sector of data
before it was ready, but if this was the case, I'd expect an ECC error
from the comparison (using Hardware generated ECC, prefetch and DMA).
This is not the case (I added a printk that triggers if
omap_compare_ecc() returns non-zero).  So if no ECC error is reported
then the data should be valid on NAND.  To test if the data was not
written correctly I unmounted the filesystem and remounted it, but
then the md5sum does match.

This is not the first I've seen of the problem.  I've see it in a
2.6.28-rc8 kernel, and in the 2.6.32 kernel I've tried turning off
DMA, prefetch, and that hastens when the error turns up (and the
number of 0xff30 shorts seein).  I modified my testing to use a unique
pattern intead of zeros and found when the 0xff30 shows up, it repeats
for a number of shorts at the start of a page, then I see the data
that I expected from the page. I've also modified the NAND driver to
use a dev_ready function (as well as statistics to track how long it
waits polling the R/B# line on WAIT0 that indicate its 21.2uS +/-
8.29uS once the call to omap_device_ready is made), and still no joy.

I've also run this code on a 2.6.33-rc3 kernel with the same driver
set and there it works flawlessly.  Unfortunately I need the Poky
kernel...

At this point I'm at a loss to explain what is happening:

1) Has anyone seen this type of error before?

2) Are there any OMAP35x errata that could possibly explain what I'm seeing?

3) Has anyone done exhaustive testing of NAND-based filesystem on an
OMAP35x board?

4) Any suggestions where to look next? (YAFFS testing with nandsim on
an x86 doesn't exhibit the problem).

The following is the original test script (cd into the mountpoint of
the NAND filesystem before running):

#!/bin/bash

# MD5sum of 30M and 1K of zeros
md5_30M=281ed1d5ae50e8419f9b978aab16de83
md5_1K=0f343b0931126a20f133d67c2b018a3b

# temp file to use as intermediary copy
tmpfile=/dev/tmp/junk
#tmpfile=/tmp/junk
mkdir -p `dirname $tmpfile`

mismatches=0
pass=0
passes=120
if [ "$1" != "" ]; then
    passes="$1"
fi


# $1 is file
# $2 is good checksum
    chk_md5sum() {
	for cmf in $1
	do
	    cp $cmf $tmpfile
	    ret=`md5sum $tmpfile | cut -d" " -f1`
	    if [ "$ret" != "$2" ]; then
		echo "$cmf: mismatch $ret != $2"
		hexdump < $tmpfile | head -100
		mismatches=`expr $mismatches + 1`
	    else
		echo "$cmf: match $ret"
	    fi
	done
    }

# $1 is src
# $2 is destination
# $3 is expected md5sum of source
    chk_cp() {
	cp $1 $tmpfile
	cp $tmpfile $2
	ret=`md5sum $tmpfile | cut -d" " -f1`
	if [ "$ret" != "$3" ]; then
	    echo "$1: mismatch $ret != $3"
	    hexdump < $tmpfile | head -100
	    mismatches=`expr $mismatches + 1`
	fi
    }

while [ $pass -lt $passes ]; do
    pass=`expr $pass + 1`
    echo "Pass: $pass Errors: $mismatches"
    date

# create a 30 M file
    echo "Create 30M file of zeros and get md5sum"
    dd if=/dev/zero of=somefile.1 bs=1 seek=30M count=0

    chk_md5sum somefile.1 $md5_30M

# create copies of file
    for f in 2 3 4 ;
    do
	cp somefile.1 somefile.$f
    done

    echo "Calculate md5sums for copied files"
    chk_md5sum "somefile.*" $md5_30M
    if [ "$mismatches" != "0" ]; then
	break;
    fi

    echo "execute sync and recalculate md5sums"
    sync
    chk_md5sum "somefile.*" $md5_30M
    if [ "$mismatches" != "0" ]; then
	break;
    fi

    echo "Delete one of the files"
    rm somefile.2

    echo "recopy the deleted file"
    cp somefile.1 somefile.7
    chk_md5sum "somefile.*" $md5_30M
    if [ "$mismatches" != "0" ]; then
	break;
    fi

    echo "Creating test folder and some junk files in that folder"
    mkdir -p test

    cd test

    dd if=/dev/random of=junk.1 bs=1 count=0 seek=1k

    chk_md5sum junk.1 $md5_1K
    if [ "$mismatches" != "0" ]; then
	break;
    fi

    for f in 2 3 4 5 6 7 8 9;
    do
	chk_cp junk.1 junk.$f $md5_1K
	if [ "$mismatches" != "0" ]; then
	    break;
	fi
    done

    echo "md5sums of all files in test folder"
    chk_md5sum "junk.*" $md5_1K
    if [ "$mismatches" != "0" ]; then
	break;
    fi

    echo "execute sync and recalculate md5sums"
    sync

    chk_md5sum "junk.*" $md5_1K
    if [ "$mismatches" != "0" ]; then
	break;
    fi

    echo "Remove some files and recreate them"
    for f in  3 5 8;
    do
	rm junk.$f
    done

    for f in  8 3 5;
    do
	chk_cp junk.1 junk.$f $md5_1K
	if [ "$mismatches" != "0" ]; then
	    break;
	fi
    done

    cd ..

    echo "Calculate md5sums for 30M files again"
    chk_md5sum "somefile.*" $md5_30M
    if [ "$mismatches" != "0" ]; then
	break;
    fi

    echo "execute sync and recalculate md5sums"
    sync
    chk_md5sum "somefile.*" $md5_30M
    if [ "$mismatches" != "0" ]; then
	break;
    fi

    if [ -f /proc/yaffs ]; then
	cat /proc/yaffs
    fi
    if [ -f /proc/nand-wait-stats ]; then
	cat /proc/nand-wait-stats
    fi
done
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Arm (vger)]     [ARM Kernel]     [ARM MSM]     [Linux Tegra]     [Linux WPAN Networking]     [Linux Wireless Networking]     [Maemo Users]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux