Re: 2.4.17, SW raid5 & swap & root -> NOT SAFE

Neil Brown <neilb@cse.unsw.edu.au> · Fri, 1 Feb 2002 11:54:33 +1100 (EST)

On Thursday January 31, danci@agenda.si wrote:
> I have three disks, each partitioned into three partitions. For /boot,
> I
> use RAID1 over first partitions (hde1, hdg1 and hdi1).
> 
> For root, I use RAID5 over second set of partitions (hde2, hdg2 and hdi2)
> and for swap, I use RAID5 over the third set of partitions (hde3, hdg3 and
> hdi3).
> 
> The problem is, that each time I try to simulate a disk failure (using
> raidsetfaulty) on any of RAID5 arrays, I get a nasty error (this is a
> 'copy-paste' version, the actual md device is blocked):
> 
> raid5: Disk failure on hdi2, disabling device. Operation continuing on 2
> devicesUnable to handle kernel NULL pointer dereference at virtual address
> 00000000
>  printing eip:
> c01d1cd3

Any change of running this Oops through ksymoops to see where the
problem really is?

NeilBrown

> *pde = 00000000
> Oops: 0000
> CPU:    0
> EIP:    0010:[<c01d1cd3>]    Not tainted
> EFLAGS: 00010246
> eax: dfec34a0   ebx: 00003802   ecx: 00000400   edx: 00001000
> esi: 00000000   edi: dfc5e000   ebp: dfee0da0   esp: dfeb9f50
> ds: 0018   es: 0018   ss: 0018
> Process raid5d (pid: 10, stackpage=dfeb9000)
> Stack: dfee0da0 dfee0e60 dfec5dd4 dfec5dc0 00000000 dfec34a0 c01d1f78 dfee0da0
>        c02572ff 00000004 dfeb8000 dfed7c00 dfee0ca0 00000001 00000064 00000000
>        c01cdac9 dfec5dc0 dfeb8000 dfeb8000 dfee0ca0 00000001 c0288000 00000246
> Call Trace: [<c01d1f78>] [<c01cdac9>] [<c01d4c1c>] [<c0105794>]
> 
> Code: f3 a5 8b 44 24 14 8b 54 24 10 f0 0f ab 50 18 8b 44 24 14 e8
>  [root@temp /root]# <6>md: recovery thread got woken up ...
> md1: no spare disk to reconstruct array! -- continuing in degraded mode
> md: recovery thread finished ...
> 
> 
> It seems something is wrong in raid5 code... Can anyone confirm/deny this?
> 
>    Thanks, D.
> 
> PS: I tried using ext2 and ext3 on the root partition. It didn't matter.
> 
> PPS: If I disconnect one of the disks (the power cable) as to simulate
> real hardware failure, the disk IO is blocked (ie. nothing that isn't
> already in memory cannot be loaded or executed) and the system is telling
> me that the disk has lost interrupt - for a looong time. The RAID system
> didn't detect the failure and kick the disk out of array(s). Shouldn't it?
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html