Second Block on Partition overwritten with 0xFF

"Tomas Pospisek ML" <tpo2@xxxxxxxxxxxxx> · Mon, 03 Sep 2007 15:01:49 +0000

Hello everybody

we're running a small population of lightly embedded machines with the
following specs:

System: +- standard intel box
FS: ext3 (defaults,errors=remount-ro,noatime)
HD: TRANSCEND, ATA DISK drive, Compact Flash (CF), 2000880 sectors (1024
MB) w/2KiB Cache, CHS=1985/16/63
Driver: Standard IDE Driver
            ICH4: chipset revision 2
            ICH4: not 100% native mode: will probe irqs later
               ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:pio,
hdb:pio
               ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio,
hdd:pio
kernel: 2.6.15.6 #1 PREEMPT Sat Mar 11 00:56:41 CET 2006 i686 GNU/Linux

ext3 was chosen in the hope to make the system more power-failure
resilient. The system run on a UPS, but unfortunately some operators
will just pull the power plug (allthought they're instucted not to).

What we have experienced now multiple times is, that the systems run just
fine, absolutely no complaints in dmesg/kern.log, until it is rebooted
(shutdown -r now). At that point, *very rarely* GRUB will no longer be
able to read the boot filesystem (Error 17).

I've checked the on-disk data and have discovered that 0x200-0x1c00 is
overwritten with 0xff, then a single 0x0f and after that 0x00 untill
0x207f

That is the second to the sixteenth on-disk blocks have been overwritten:

000001e0  53 59 53 4d 53 44 4f 53  20 20 20 53 59 53 7f 01  |SYSMSDOS  
SYS..|
000001f0  00 41 bb 00 07 60 66 6a  00 e9 3b ff 00 00 00 00 
|.A»..`fj.é;ÿ....|
00000200  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff 
|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
*
00001c00  ff 0f 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
|ÿ...............|
00001c10  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
|................|
*
00002080  ed 41 00 00 00 04 00 00  1e 39 a0 46 a6 6a dd 45  |íA.......9
F¦jÝE|

Our project does no hardware-level operations. All access is through
regular file-operations only. Thus there's no way we're aware of that
our software would be changing blocks on-disk directly.

What's striking about the problem above is that the first affected block
starts _before_ the on-disk filesystem (0x200), which starts at 0x400.

My question is: does the ext3 driver _ever_ write outside of its own
space on disk - i.e into 0x000-0x400? That is can we exclude with
certainity that it's _not_ the ext3 driver causing the problem?

What else could cause the problem then? We don't see any sign of a
problem before reboot only after. Could the IDE driver be the problem?
Or is it the IDE CF Card HW?

I've done a dd=/dev/hdc of=/dev/null and there was absolutely no trouble
visible (nothing in kern.log/dmesg), thus the card does not seem to be
broken on the physical level and doesn't have badblocks that would fail
on read.

Does this ring a bell with anybody?
*t

_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users