Buffer-cache corruption with SMP + PIO IDE

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I'm looking for some guidance in tracking down this strange drive
corruption that occurs under the following conditions:

	- IDE (PIO mode)
	- SMP system
	- Buffer-cache is filled as much as it can be filled
	- No swap file used

My test showing the corruption is a stupid script that extracts a
large .tar.bz2 file to the drive's filesystem, then verifies MD5
checksums of each extracted file against a list of known checksum
values.  It then deletes the extracted files, and repeats the whole
process forever.

When the corruption happens, the MD5 sums for maybe 5-20 of the 1000+
files will be wrong.  In the corrupted files, I see that it's missing 2
bytes somewhere in the middle in a few places, and then after a chunk of
valid data I see two bogus "0xd0" bytes.  In all files, it always seems
to be "0xd0 0xd0".  It looks like it actually wrote the data fine to
disk, but is wrong in the buffer cache.

Platform: An embedded single-board computer with dual MPC7448 processors

The problem exists both while using a PMC hard drive (controller
accessed over PCI bus) and an IDE controller on the board's FPGA wired
to a CompactFlash slot.  I verified that the corruption happened with
both ext3 and ReiserFS.

The problem does NOT occur if I use DMA mode with the PMC hard drive, or
on a Uniprocessor kernel.  Maybe nobody has stumbled upon this since the
combination of a multiprocessor system with PIO IDE seems unlikely?

Also, and I think this is key: The test will run fine until it appears
that the buffer-cache occupies as much memory as it can.  That is, if I
run a simple program that mallocs 300 MB in the background while the
test runs, it will fail quite soon.   I initially chalked it up to
expected behavior for a low-memory situation, but the same test setup
runs fine in a uniprocessor kernel though.

Using kernel 2.6.16, but I'm fairly sure this problem also happened in
2.6.11 as well.

Any ideas?  I've tried things like disabling the L2 cache on both CPUs,
enforcing HW cache-coherency, adding additional spin-locks in places,
but to no avail.

Thanks,

- Nate Case <ncase@xxxxxxxxxxx>

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux