Re: 2.4.17, SW raid5 & swap & root -> NOT SAFE

Alvin Oga <aoga@Maggie.Linux-Consulting.com> · Thu, 31 Jan 2002 16:27:01 -0800 (PST)

hi ya danilo

i suspect you dont have a spare disk ??? since you have 3 drives
and hde, hdg, and hdi is used...
	- so the error message is correct ( it will run in degraded mode )

- some motherboards/pci cards are dumb...
	- you CANNOT boot off of hde,hdf,hdg,hdh....
	( bad choices of mb for booting raid5 i suppose 

- if your disk partition type is "raid-autodetec" ( fd ) ... than your
  setup is fine... just move everything to hda, hdc and hde  and hope
  that hde is the drive that fails...
	- boot from a stand alone floppy/linux-bbc to move its root
	around  

- if you cannot boot off hda+hdc  or hda+hde or hdc+hde than i'd use a
  simple mirror'ed root raid so that you can boot off of either hda or hdc
  but your raid5 data is still raid5 across hda, hdc, hde
	and i'd add a new hdf disk too

and yes... pulling the ide cable off of the disk  is a good test for
booting and testing raid5... cool ...

c ya
alvin
http://www.1U-Raid5.net ...

On Thu, 31 Jan 2002, Danilo Godec wrote:

> I have three disks, each partitioned into three partitions. For /boot,
> I
> use RAID1 over first partitions (hde1, hdg1 and hdi1).
> 
> For root, I use RAID5 over second set of partitions (hde2, hdg2 and hdi2)
> and for swap, I use RAID5 over the third set of partitions (hde3, hdg3 and
> hdi3).
> 
> The problem is, that each time I try to simulate a disk failure (using
> raidsetfaulty) on any of RAID5 arrays, I get a nasty error (this is a
> 'copy-paste' version, the actual md device is blocked):
> 
> raid5: Disk failure on hdi2, disabling device. Operation continuing on 2
> devicesUnable to handle kernel NULL pointer dereference at virtual address
> 00000000
>  printing eip:
> c01d1cd3
> *pde = 00000000
> Oops: 0000
> CPU:    0
> EIP:    0010:[<c01d1cd3>]    Not tainted
> EFLAGS: 00010246
> eax: dfec34a0   ebx: 00003802   ecx: 00000400   edx: 00001000
> esi: 00000000   edi: dfc5e000   ebp: dfee0da0   esp: dfeb9f50
> ds: 0018   es: 0018   ss: 0018
> Process raid5d (pid: 10, stackpage=dfeb9000)
> Stack: dfee0da0 dfee0e60 dfec5dd4 dfec5dc0 00000000 dfec34a0 c01d1f78 dfee0da0
>        c02572ff 00000004 dfeb8000 dfed7c00 dfee0ca0 00000001 00000064 00000000
>        c01cdac9 dfec5dc0 dfeb8000 dfeb8000 dfee0ca0 00000001 c0288000 00000246
> Call Trace: [<c01d1f78>] [<c01cdac9>] [<c01d4c1c>] [<c0105794>]
> 
> Code: f3 a5 8b 44 24 14 8b 54 24 10 f0 0f ab 50 18 8b 44 24 14 e8
>  [root@temp /root]# <6>md: recovery thread got woken up ...
> md1: no spare disk to reconstruct array! -- continuing in degraded mode
> md: recovery thread finished ...
> 
> 
> It seems something is wrong in raid5 code... Can anyone confirm/deny this?
> 
>    Thanks, D.
> 
> PS: I tried using ext2 and ext3 on the root partition. It didn't matter.
> 
> PPS: If I disconnect one of the disks (the power cable) as to simulate
> real hardware failure, the disk IO is blocked (ie. nothing that isn't
> already in memory cannot be loaded or executed) and the system is telling
> me that the disk has lost interrupt - for a looong time. The RAID system
> didn't detect the failure and kick the disk out of array(s). Shouldn't it?
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html