On Jan 28, 2002 15:46 -0800, Ben Rockwood wrote: > >That is a very unusual situation, because e2fsck will rarely crash, and it > >should be impossible for it to hang the system. Do you get any errors on > >the screen related to IDE/irq/hda failures? > > No. I got an "Oops" message, but it wasn't very helpful. > The drive has worked flawlessly. It's a u160 IBM UltraStar. Well, unfortunately the oops message (screenshot in a separate email) is mostly unusable because it has not been decoded via ksymoops on your machine. There was one oops during e2fsck, but there shouldn't be anything that e2fsck could do to cause a kernel oops, as it is a user-space process. You could try hand-typing these oops messages into a text file, and then run ksymoops on them to see if it decodes to anything useful. Is this one of the IBM drives made in Hungary (or somewhere like that) which has astronomical defect rates? Maybe that is also worth checking. > >Sounds like it is hardware (disk/ram/cpu). Did you make any other changes > >to your system since the last time it was rebooted (e.g. new kernel, > >hdparm settings, BIOS changes, etc)? > > No changes what so ever. The system had crashed for what seemed like > no reason to me, I hard reset it and during the boot up FSCK it hung. Can't comment. The e2fsck run got "signal 11" which for gcc normally means that you have a memory error. I don't know if the same applies to e2fsck (if it is a libc/malloc error message it might). > >Which one are you using now? Probably 2.4.17/18-preX is the safest to use, > >I never used 2.4.15, so I don't know much about it, and 2.5.1 is certainly > >not going to be the most robust kernel. > > I'm primarily relying on 2.4.17-mjc, because it's the most robust kernel > I have, and I trust it the most. I was using 2.5.1 for about a week after > it came out and had no problems with it, but moved back to 2.4 to play > with the new 0(1) shedualer and some of the other performance boosts. > The exact same problems occur under 2.5.1 and 2.4.15. OK, well a 2.4.17 kernel should be pretty robust, but I'm not totally sure what's in -mjc kernels. I've been running 2.4.17-pre8 for a long time without any problems on my development laptop (mostly light usage, occasional parallel makes in progress). > Frankly, I really wish I had parts to swap out on this box, but I > don't have the parts... Well, some options still exist to try and narrow the problem: 1) check that all the fans/heatsinks are OK 2) underclock the CPU 3) run memtest86 to check the RAM 4) remove one or more DIMMs and/or swap locations of DIMMs 5) try a vanilla kernel built on another machine Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/