Data corruption when using INIC-1623TA2 controller

Ben Hutchings <ben@xxxxxxxxxxxxxxx> · Sun, 30 Jun 2013 15:49:24 +0100

Martin Braure de Calignon wrote:
> I'm experiencing frequent data corruption on my raid1 ext4 fs.
> The error is not always the same.
> 
> I first thought it was due to a previous resize of the FS I've done.
> I had multiple times some message about huge amount of multiply claimed blocks in inode xxxx.
> fsck.ext4 was not working fully and was always ending with a message like "FS still have error".
> I was unable to copy all the files to another FS to save it.
> So I end up checking the badblocks (that's where I've been dumb, I choose a non data conservative way).
> However, badblock was succesful without errors.
> So in the end I lost some data, however, I don't know if it's due to the bug or the the badblocks check. So feel free to readjust severity.
> 
> Since theni, I have bought two brand new disks, created a completly new ext4 FS, and copied the files that I had succesfully recovered.
> Then I run fsck.ext4 on the FS... it seems it is almost working.
> I'm remounting the /dev/md0... And each time I start using the system seriously, I have new errors, like the one I had today:
> (I was just copying files on it)
[...]
> [1436849.120036] EXT4-fs (md0): error count: 6
> [1436849.120044] EXT4-fs (md0): initial error at 1371763084: htree_dirblock_to_tree:587: inode 20971803: block 83894316
> [1436849.120054] EXT4-fs (md0): last error at 1371765809: htree_dirblock_to_tree:587: inode 41813096: block 167256110
> [1446656.923648] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode #52698372: block 210773049: comm smbd: bad entry in directory: directory entry across blocks - offset=1052(9244), inode=1949184565, rec_len=29816, name_len=24
[...]

The kernel log also showed the CPU was reaching its temperature limit,
but after he cleaned out the CPU cooler and corrected the CPU frequency
the problem persisted.  I suggested swapping disks between controllers:

On Fri, 2013-06-28 at 15:55 +0200, Martin Braure de Calignon wrote:
[...]
> So as planned I unplugged the working non RAID1 disk from their
> controller, and connect the ext4 RAID1 and the ext3 RAID1 disk to it
> (yeah these are 2 powerful RAID1 with 1 device only ;) for testing
> purposes).
> I also tried to re-plug each PCI card, and connect the video card fan
> that was not connected (yeah it was a bad idea to limit the noise level
> few years ago).
> 
> I did all the tests I could to try to overheat the system (same as
> yesterday):
> * 4 running dd if=/dev/urandom | gzip >/dev/null for the cpu
> * massive copy from one disk to the other 
> * delete of duplicates between two directories (with many duplicates)
> 
> All that in parallel. Everything seems to work fine. No corruption nor
> CPU overheating message (yesterday I still had some even after remove
> the overclock of the CPU).
[...]
> Here's the lspci -vvvv for this card (if I'm not wrong):
> 
>                 02:09.0 SATA controller: Initio Corporation INI-1623 PCI
>                 SATA-II Controller (rev 02) (prog-if 00 [Vendor
>                 specific])
>                         Subsystem: Initio Corporation Device 1626
>                         Control: I/O+ Mem+ BusMaster+ SpecCycle-
>                 MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
>                 DisINTx-
>                         Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr-
>                 DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
>                 INTx-
>                         Latency: 32, Cache Line Size: 32 bytes
>                         Interrupt: pin A routed to IRQ 17
>                         Region 0: I/O ports at 9000 [size=256]
>                         Region 1: Memory at ef022000 (32-bit,
>                 non-prefetchable) [size=4K]
>                         [virtual] Expansion ROM at 80000000 [disabled]
>                 [size=128K]
>                         Capabilities: [dc] Power Management version 2
>                                 Flags: PMEClk+ DSI- D1+ D2+
>                 AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
>                                 Status: D0 NoSoftRst- PME-Enable- DSel=0
>                 DScale=0 PME-
>                         Kernel driver in use: sata_inic162x
[...]

So this does seem to be a fault in either this card or the driver.  Can
you suggest any further tests that Martin could do?

Ben.

-- 
Ben Hutchings
Sturgeon's Law: Ninety percent of everything is crap.
Attachment:
signature.asc

Description: This is a digitally signed message part