Stephen C. Tweedie wrote: >Hi, > >On Fri, Jan 18, 2002 at 04:15:28PM +0900, P. Fleury wrote: > > >>I use ext3 over Software RAID-5, and access this through Samba/NFS/HTTP. >> From time to time, the machine hangs, no response to any kind of >>input (ping does not respond, nor keyboard/mouse). Only hard-reset does >>the trick. >> >>I also notices that 2 of the 7 disks are in UDMA 33, the others in UDMA >>100. Does this have any impact ? (besides performance) >> >>If I do not mount the ext3 partition, it runs fine. Any help ? >> > >Can you trap kernel log output, in case there's an oops being >reported? If you have a text-mode console, you may have to copy it >down by hand. If not, it is possible to set up a serial console and >record the kernel output on another machine. > >Cheers, > Stephen > Well, this time I got something. The sequence was: - start machine, use it for 1/2 day, access it via NFS, HTTP, IMAP (3 concurrent sites) and Samba. - After a while, machine load goes up, login impossible even on console. After an hour, I could login as root. - tried to reboot, to no avail. After 1 hour of waiting, tried 'telinit 6'. Then, remotely, nothing more possible. - The machine did not reboot, says /dev/md cannot be unmounted, it is busy. - hard reset. - RAID-5 resync running for a while, then: Jan 25 10:35:21 lafleur syslogd 1.4.1: restart. Jan 25 11:26:38 lafleur kernel: Unable to handle kernel paging request at virtual address 493dd238 Jan 25 11:26:38 lafleur kernel: printing eip: Jan 25 11:26:38 lafleur kernel: f083eff2 Jan 25 11:26:38 lafleur kernel: *pde = 00000000 Jan 25 11:26:38 lafleur kernel: Oops: 0002 Jan 25 11:26:38 lafleur kernel: CPU: 0 Jan 25 11:26:38 lafleur kernel: EIP: 0010:[3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1388558/96] Not tainted Jan 25 11:26:38 lafleur kernel: EIP: 0010:[<f083eff2>] Not tainted Jan 25 11:26:38 lafleur kernel: EFLAGS: 00010216 Jan 25 11:26:38 lafleur kernel: eax: 00000000 ebx: 00001000 ecx: 00000400 edx: 00000000 Jan 25 11:26:38 lafleur kernel: esi: 00000018 edi: 493dd238 ebp: 00000007 esp: efb19e58 Jan 25 11:26:38 lafleur kernel: ds: 0018 es: 0018 ss: 0018 Jan 25 11:26:38 lafleur kernel: Process raid5d (pid: 19, stackpage=efb19000) Jan 25 11:26:38 lafleur kernel: Stack: ef861804 00001000 c017c6ad c033ce80 00000282 00000282 00000003 c1f6f908 Jan 25 11:26:38 lafleur kernel: c21d6400 00000000 00000007 00000000 00000001 00000004 f083ffd8 ef861800 Jan 25 11:26:38 lafleur kernel: 00000002 c01871dd 00000246 c033ce40 0000000c 0000007c fffffffc fffffff4 Jan 25 11:26:38 lafleur kernel: Call Trace: [generic_make_request+241/256] generic_make_request [kernel] 0xf1 Jan 25 11:26:38 lafleur kernel: Call Trace: [<c017c6ad>] generic_make_request [kernel] 0xf1 Jan 25 11:26:38 lafleur kernel: [3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1384488/96] __insmod_raid5_S.text_L13736 [raid5] 0x1f78 Jan 25 11:26:38 lafleur kernel: [<f083ffd8>] __insmod_raid5_S.text_L13736 [raid5] 0x1f78 Jan 25 11:26:38 lafleur kernel: [ide_set_handler+85/92] ide_set_handler [kernel] 0x55 Jan 25 11:26:38 lafleur kernel: [<c01871dd>] ide_set_handler [kernel] 0x55 Jan 25 11:26:38 lafleur kernel: [ide_dma_intr+0/156] ide_dma_intr [kernel] 0x0 Jan 25 11:26:38 lafleur kernel: [<c0190a3c>] ide_dma_intr [kernel] 0x0 Jan 25 11:26:38 lafleur kernel: [dma_timer_expiry+0/100] dma_timer_expiry [kernel] 0x0 Jan 25 11:26:38 lafleur kernel: [<c019114c>] dma_timer_expiry [kernel] 0x0 Jan 25 11:26:38 lafleur kernel: [do_IRQ+144/156] do_IRQ [kernel] 0x90 Jan 25 11:26:38 lafleur kernel: [<c0108110>] do_IRQ [kernel] 0x90 Jan 25 11:26:38 lafleur kernel: [3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1382890/96] device_bsize [raid5] 0x222 Jan 25 11:26:38 lafleur kernel: [<f0840616>] device_bsize [raid5] 0x222 Jan 25 11:26:38 lafleur kernel: [md_thread+212/308] md_thread [kernel] 0xd4 Jan 25 11:26:38 lafleur kernel: [<c01b1454>] md_thread [kernel] 0xd4 Jan 25 11:26:38 lafleur kernel: [kernel_thread+38/48] kernel_thread [kernel] 0x26 Jan 25 11:26:38 lafleur kernel: [<c010566e>] kernel_thread [kernel] 0x26 Jan 25 11:26:38 lafleur kernel: [md_thread+0/308] md_thread [kernel] 0x0 Jan 25 11:26:38 lafleur kernel: [<c01b1380>] md_thread [kernel] 0x0 Jan 25 11:26:38 lafleur kernel: Jan 25 11:26:38 lafleur kernel: Jan 25 11:26:38 lafleur kernel: Code: f3 ab f6 c3 02 74 02 66 ab f6 c3 01 74 01 aa 8b 14 24 8d 5d After this, trying reboot says umount2 has problems, MD thread is being interrupted after the message 'Wait while the system is restarting' but nothing happens. Is there a way to spend less than 30 minutes per day baby-sitting my server ? --Pascal