Hi, On Fri, Jan 25, 2002 at 01:05:19PM +0900, P. Fleury wrote: > >Can you trap kernel log output, in case there's an oops being > >reported? If you have a text-mode console, you may have to copy it > >down by hand. If not, it is possible to set up a serial console and > >record the kernel output on another machine. > Well, this time I got something. The sequence was: > - start machine, use it for 1/2 day, access it via NFS, HTTP, IMAP (3 > concurrent sites) and Samba. > - After a while, machine load goes up, login impossible even on console. > After an hour, I could login as root. That's consistent with an oops, yes. > - tried to reboot, to no avail. After 1 hour of waiting, tried 'telinit > 6'. Then, remotely, nothing more possible. > - The machine did not reboot, says /dev/md cannot be unmounted, it is busy. > - hard reset. > - RAID-5 resync running for a while, then: > > Jan 25 10:35:21 lafleur syslogd 1.4.1: restart. > Jan 25 11:26:38 lafleur kernel: Unable to handle kernel paging request > at virtual address 493dd238 Well, we know ext3 on soft raid5 is usable in that kernel. On a development box here --- $ uptime 9:52am up 81 days, 1:53, 32 users, load average: 0.39, 0.44, 0.73 $ uname -a Linux porkchop.redhat.com 2.4.9-13smp #1 SMP Tue Oct 30 19:57:16 EST 2001 i686 unknown $ df -h /dev/md1 Filesystem Size Used Avail Use% Mounted on /dev/md1 200G 181G 9.4G 96% /mnt/md1 $ grep md1 /proc/mdstat md1 : active raid5 sdh1[3] sdg1[2] sdf1[1] sde1[0] sdd1[7] sdc1[6] sdb1[5] sda1[4] That's a 2.4.9-13 kernel with a few soft raid partitions, the largest of which is 200GB of ext3 on raid5 striped over 7 disks (plus one hot spare). There are currently 20 different users logged on, it is used regularly for large builds, and has been running perfectly for 81 days so far. I'd suspect that you have a hardware problem, but only by capturing more info will we be able to be sure. The last oops you showed me was inside the raid5 code, but that code touches so much of memory that it can be especially sensitive to bad memory. Again, can you set up serial console to capture any oopses reliably? I don't think we can easily narrow down the problem here without that. Cheers, Stephen