ext3, S/W RAID-5 and many services

sct@redhat.com (Stephen C. Tweedie) · Mon, 28 Jan 2002 15:00:16 +0000

Hi,

On Fri, Jan 25, 2002 at 01:05:19PM +0900, P. Fleury wrote:

>  >Can you trap kernel log output, in case there's an oops being
>  >reported?  If you have a text-mode console, you may have to copy it
>  >down by hand.  If not, it is possible to set up a serial console and
>  >record the kernel output on another machine.

> Well, this time I got something. The sequence was:
> - start machine, use it for 1/2 day, access it via NFS, HTTP, IMAP (3
> concurrent sites) and Samba.
> - After a while, machine load goes up, login impossible even on console.
> After an hour, I could login as root.

That's consistent with an oops, yes.

> - tried to reboot, to no avail. After 1 hour of waiting, tried 'telinit
> 6'. Then, remotely, nothing more possible.
> - The machine did not reboot, says /dev/md cannot be unmounted, it is busy.
> - hard reset.
> - RAID-5 resync running for a while, then:
> 
> Jan 25 10:35:21 lafleur syslogd 1.4.1: restart.
> Jan 25 11:26:38 lafleur kernel: Unable to handle kernel paging request
> at virtual address 493dd238

Well, we know ext3 on soft raid5 is usable in that kernel.  On a
development box here ---

$ uptime
  9:52am  up 81 days,  1:53, 32 users,  load average: 0.39, 0.44, 0.73
$ uname -a
Linux porkchop.redhat.com 2.4.9-13smp #1 SMP Tue Oct 30 19:57:16 EST 2001 i686 unknown
$ df -h /dev/md1
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1              200G  181G  9.4G  96% /mnt/md1
$ grep md1 /proc/mdstat 
md1 : active raid5 sdh1[3] sdg1[2] sdf1[1] sde1[0] sdd1[7] sdc1[6] sdb1[5] sda1[4]

That's a 2.4.9-13 kernel with a few soft raid partitions, the largest
of which is 200GB of ext3 on raid5 striped over 7 disks (plus one hot
spare).  There are currently 20 different users logged on, it is used
regularly for large builds, and has been running perfectly for 81 days
so far.

I'd suspect that you have a hardware problem, but only by capturing
more info will we be able to be sure.  The last oops you showed me was
inside the raid5 code, but that code touches so much of memory that it
can be especially sensitive to bad memory.

Again, can you set up serial console to capture any oopses reliably?
I don't think we can easily narrow down the problem here without that.

Cheers,
 Stephen