2.4.17, SW raid5 & swap & root -> NOT SAFE

Danilo Godec <danci@agenda.si> · Thu, 31 Jan 2002 15:50:12 +0100 (CET)

I have three disks, each partitioned into three partitions. For /boot,
I
use RAID1 over first partitions (hde1, hdg1 and hdi1).

For root, I use RAID5 over second set of partitions (hde2, hdg2 and hdi2)
and for swap, I use RAID5 over the third set of partitions (hde3, hdg3 and
hdi3).

The problem is, that each time I try to simulate a disk failure (using
raidsetfaulty) on any of RAID5 arrays, I get a nasty error (this is a
'copy-paste' version, the actual md device is blocked):

raid5: Disk failure on hdi2, disabling device. Operation continuing on 2
devicesUnable to handle kernel NULL pointer dereference at virtual address
00000000
 printing eip:
c01d1cd3
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c01d1cd3>]    Not tainted
EFLAGS: 00010246
eax: dfec34a0   ebx: 00003802   ecx: 00000400   edx: 00001000
esi: 00000000   edi: dfc5e000   ebp: dfee0da0   esp: dfeb9f50
ds: 0018   es: 0018   ss: 0018
Process raid5d (pid: 10, stackpage=dfeb9000)
Stack: dfee0da0 dfee0e60 dfec5dd4 dfec5dc0 00000000 dfec34a0 c01d1f78 dfee0da0
       c02572ff 00000004 dfeb8000 dfed7c00 dfee0ca0 00000001 00000064 00000000
       c01cdac9 dfec5dc0 dfeb8000 dfeb8000 dfee0ca0 00000001 c0288000 00000246
Call Trace: [<c01d1f78>] [<c01cdac9>] [<c01d4c1c>] [<c0105794>]

Code: f3 a5 8b 44 24 14 8b 54 24 10 f0 0f ab 50 18 8b 44 24 14 e8
 [root@temp /root]# <6>md: recovery thread got woken up ...
md1: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...

It seems something is wrong in raid5 code... Can anyone confirm/deny this?

   Thanks, D.

PS: I tried using ext2 and ext3 on the root partition. It didn't matter.

PPS: If I disconnect one of the disks (the power cable) as to simulate
real hardware failure, the disk IO is blocked (ie. nothing that isn't
already in memory cannot be loaded or executed) and the system is telling
me that the disk has lost interrupt - for a looong time. The RAID system
didn't detect the failure and kick the disk out of array(s). Shouldn't it?

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html