Hello everybody we're running a small population of lightly embedded machines with the following specs: System: +- standard intel box FS: ext3 (defaults,errors=remount-ro,noatime) HD: TRANSCEND, ATA DISK drive, Compact Flash (CF), 2000880 sectors (1024 MB) w/2KiB Cache, CHS=1985/16/63 Driver: Standard IDE Driver ICH4: chipset revision 2 ICH4: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:pio, hdb:pio ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio kernel: 2.6.15.6 #1 PREEMPT Sat Mar 11 00:56:41 CET 2006 i686 GNU/Linux ext3 was chosen in the hope to make the system more power-failure resilient. The system run on a UPS, but unfortunately some operators will just pull the power plug (allthought they're instucted not to). What we have experienced now multiple times is, that the systems run just fine, absolutely no complaints in dmesg/kern.log, until it is rebooted (shutdown -r now). At that point, *very rarely* GRUB will no longer be able to read the boot filesystem (Error 17). I've checked the on-disk data and have discovered that 0x200-0x1c00 is overwritten with 0xff, then a single 0x0f and after that 0x00 untill 0x207f That is the second to the sixteenth on-disk blocks have been overwritten: 000001e0 53 59 53 4d 53 44 4f 53 20 20 20 53 59 53 7f 01 |SYSMSDOS SYS..| 000001f0 00 41 bb 00 07 60 66 6a 00 e9 3b ff 00 00 00 00 |.A»..`fj.é;ÿ....| 00000200 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ| * 00001c00 ff 0f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |ÿ...............| 00001c10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00002080 ed 41 00 00 00 04 00 00 1e 39 a0 46 a6 6a dd 45 |íA.......9 F¦jÝE| Our project does no hardware-level operations. All access is through regular file-operations only. Thus there's no way we're aware of that our software would be changing blocks on-disk directly. What's striking about the problem above is that the first affected block starts _before_ the on-disk filesystem (0x200), which starts at 0x400. My question is: does the ext3 driver _ever_ write outside of its own space on disk - i.e into 0x000-0x400? That is can we exclude with certainity that it's _not_ the ext3 driver causing the problem? What else could cause the problem then? We don't see any sign of a problem before reboot only after. Could the IDE driver be the problem? Or is it the IDE CF Card HW? I've done a dd=/dev/hdc of=/dev/null and there was absolutely no trouble visible (nothing in kern.log/dmesg), thus the card does not seem to be broken on the physical level and doesn't have badblocks that would fail on read. Does this ring a bell with anybody? *t _______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users