Martin Braure de Calignon wrote: > I'm experiencing frequent data corruption on my raid1 ext4 fs. > The error is not always the same. > > I first thought it was due to a previous resize of the FS I've done. > I had multiple times some message about huge amount of multiply claimed blocks in inode xxxx. > fsck.ext4 was not working fully and was always ending with a message like "FS still have error". > I was unable to copy all the files to another FS to save it. > So I end up checking the badblocks (that's where I've been dumb, I choose a non data conservative way). > However, badblock was succesful without errors. > So in the end I lost some data, however, I don't know if it's due to the bug or the the badblocks check. So feel free to readjust severity. > > Since theni, I have bought two brand new disks, created a completly new ext4 FS, and copied the files that I had succesfully recovered. > Then I run fsck.ext4 on the FS... it seems it is almost working. > I'm remounting the /dev/md0... And each time I start using the system seriously, I have new errors, like the one I had today: > (I was just copying files on it) [...] > [1436849.120036] EXT4-fs (md0): error count: 6 > [1436849.120044] EXT4-fs (md0): initial error at 1371763084: htree_dirblock_to_tree:587: inode 20971803: block 83894316 > [1436849.120054] EXT4-fs (md0): last error at 1371765809: htree_dirblock_to_tree:587: inode 41813096: block 167256110 > [1446656.923648] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode #52698372: block 210773049: comm smbd: bad entry in directory: directory entry across blocks - offset=1052(9244), inode=1949184565, rec_len=29816, name_len=24 [...] The kernel log also showed the CPU was reaching its temperature limit, but after he cleaned out the CPU cooler and corrected the CPU frequency the problem persisted. I suggested swapping disks between controllers: On Fri, 2013-06-28 at 15:55 +0200, Martin Braure de Calignon wrote: [...] > So as planned I unplugged the working non RAID1 disk from their > controller, and connect the ext4 RAID1 and the ext3 RAID1 disk to it > (yeah these are 2 powerful RAID1 with 1 device only ;) for testing > purposes). > I also tried to re-plug each PCI card, and connect the video card fan > that was not connected (yeah it was a bad idea to limit the noise level > few years ago). > > I did all the tests I could to try to overheat the system (same as > yesterday): > * 4 running dd if=/dev/urandom | gzip >/dev/null for the cpu > * massive copy from one disk to the other > * delete of duplicates between two directories (with many duplicates) > > All that in parallel. Everything seems to work fine. No corruption nor > CPU overheating message (yesterday I still had some even after remove > the overclock of the CPU). [...] > Here's the lspci -vvvv for this card (if I'm not wrong): > > 02:09.0 SATA controller: Initio Corporation INI-1623 PCI > SATA-II Controller (rev 02) (prog-if 00 [Vendor > specific]) > Subsystem: Initio Corporation Device 1626 > Control: I/O+ Mem+ BusMaster+ SpecCycle- > MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- > DisINTx- > Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- > DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- > INTx- > Latency: 32, Cache Line Size: 32 bytes > Interrupt: pin A routed to IRQ 17 > Region 0: I/O ports at 9000 [size=256] > Region 1: Memory at ef022000 (32-bit, > non-prefetchable) [size=4K] > [virtual] Expansion ROM at 80000000 [disabled] > [size=128K] > Capabilities: [dc] Power Management version 2 > Flags: PMEClk+ DSI- D1+ D2+ > AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-) > Status: D0 NoSoftRst- PME-Enable- DSel=0 > DScale=0 PME- > Kernel driver in use: sata_inic162x [...] So this does seem to be a fault in either this card or the driver. Can you suggest any further tests that Martin could do? Ben. -- Ben Hutchings Sturgeon's Law: Ninety percent of everything is crap.
Attachment:
signature.asc
Description: This is a digitally signed message part