On Thu, 20 May 2010 11:42:29 +0200 Tejun Heo <tj@xxxxxxxxxx> wrote: > > randomly flipped bits? I don't know if you saw the first couple of > > mails (before linux-ide was added), but the problem is data being moved > > around, not just randomly changed. > > I ony saw your previous posting. TLP corruption can happen during > command setup phase and bit flipping in the command address part is > definitely possible, so reads and writes can be headed at wrong places > in both memory and disk. I don't know whether this would fit your > symptom tho. > Ah. Here's the problem description from a previous mail: The corruption is 104 bytes. Somewhat odd number. I would have expected something more fundamental like a sector or a page. The data in question seems to come from another part of the file. The shifts are 015d1380 => 015d0f80 (-1024 bytes) and 02210380 => 0220ff80 (also -1024 bytes). At least the offset is a nice, sane power of two number. Noteworthy is also that the last three nibbles of the corruption are always the same (xxxxx380 => xxxxxf80). </recap> Note that the above analysis is from files, so it involves the entire stack. I've since focused on raw disks. See below. > > Another note is that the problem seems to worsen under load. I'm > > running the dd thing in the background, which seems to make read errors > > more common on my test files on the filesystem level. > > It would be great if you can try a different controller in similar > setup. I only stock sil3132 cards as those are the only decent add-on cards I've found. AHCI stuff all seems to be onboard. > But please keep trying to narrow down the problem and if > possible please remove filesystem from the stack and test against the > block device directly. That's what I've been doing the last couple of runs. From a previous mail: I did some more testing though, and this might be a low level issue. I did the following multiple times: # dd if=/dev/sde skip=4k bs=4M count=500 | md5sum And the results were: 13aa29adcd16f8d0faf3cb5c39f43826 d1e3df33c0b0d03c61f880a8f2bb6cfb 13aa29adcd16f8d0faf3cb5c39f43826 13aa29adcd16f8d0faf3cb5c39f43826 13aa29adcd16f8d0faf3cb5c39f43826 13aa29adcd16f8d0faf3cb5c39f43826 7a746328b60a63b76847c3e1319a8534 13aa29adcd16f8d0faf3cb5c39f43826 </recap2> Since the amount of data is much larger here and the incidents more rare, I haven't been able to confirm that the corruption is identical to what I've seen in the files. I'm working on the assumption that it is... I've since constructed a script that keeps re-running the above over all relevant disks and keeps track of how many unique md5 values we get. It's been running for about 1.5 hours right now, and here are the results so far: sdd - 3, sde - 4, sdf - 1, sdb - 1, sdc - 1, sdd and sde are both on the same controller, so the problem you mentioned could be relevant. I'll let the test run for a few more hours and try moving things off that controller later tonight. Thanks for looking at this. Unstable data storage is one of those things that can keep you up at night. :/ Rgds -- -- Pierre Ossman WARNING: This correspondence is being monitored by FRA, a Swedish intelligence agency. Make sure your server uses encryption for SMTP traffic and consider using PGP for end-to-end encryption.
Attachment:
signature.asc
Description: PGP signature