SILO-1.4.13 and degraded RAID1 issue

Dmitry Artamonow <mad_soft@xxxxxxxx> · Fri, 4 May 2007 20:16:47 +0400

Good day!

Some time ago I found that my HDD in U10 is going to die so I decided to 
migrate my rootfs to second disk running in degraded RAID1 mode 
(I planned to replace broken HDD later for creating complete RAID1 mirror)
I used methodics described here:
http://gentoo-wiki.com/HOWTO_Migrate_To_RAID#Migrating_from_no_RAID_to_RAID-1

I created degraded RAID on second disk and copied rootfs onto it,
then replaced 'root=/dev/hda1' with 'root=/dev/md1' in /etc/silo.conf
and rebooted computer.

After succesful reboot I tried to reinstall SILO loader on degraded RAID
(i.e. second HDD) to get rid of the first (broken) HDD, which was still
used for booting. But when I issued `silo` I got this message:

sunflower ~ # silo
/etc/silo.conf appears to be valid
Fatal error: No non-faulty disks found in RAID1

Here's the contents of my silo.conf:

sunflower ~ # cat /etc/silo.conf | grep -v "^\#" | grep -v '^$'
partition = 1
root = /dev/md1
timeout = 100
image = /boot/kernel-2.6.16.20
        label = linux-2.6.16.20
image = /boot/2.6.16.19
        label = linux

And here's my RAID config:

sunflower ~ # cat /etc/mdadm.conf  | grep -v "^\#" | grep -v '^$'
ARRAY /dev/md1 level=raid1 num-devices=2
+UUID=f1fbc4ce:a04bf77c:3198d8f6:043776c8
ARRAY /dev/md2 level=raid1 num-devices=2
+UUID=b225d38c:dd66032e:e6b0c497:97ce7adf
ARRAY /dev/md3 level=raid1 num-devices=2
+UUID=6768152f:04222a1a:d00871c9:68547878
ARRAY /dev/md4 level=raid1 num-devices=2
+UUID=1252ad0d:44bf5027:48510843:842236b4
ARRAY /dev/md5 level=raid1 num-devices=2
+UUID=22bdc48f:4902e0b5:66887282:984d65db

sunflower ~ # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hdc1[1]
      987904 blocks [2/1] [_U]

md2 : active raid1 hdc2[1]
      995904 blocks [2/1] [_U]

md3 : active raid1 hdc4[1]
      4000064 blocks [2/1] [_U]

md4 : active raid1 hdc5[1]
      2000000 blocks [2/1] [_U]

md5 : active raid1 hdc6[1]
      32194176 blocks [2/1] [_U]

unused devices: <none>


I looked into silo.c and found this code snippet:
----------------------------------------------------------------------------
1004:   case 9:         /* RAID device */
1005:           {
1006:                   md_array_info_t md_array_info;
1007:                   md_disk_info_t md_disk_info;
1008:                   int md_fd, i, id = 0;
1009:                   struct hwdevice *d, *last;
1010: 
1011:                   sprintf (dev, "/dev/md%d", minno);
1012:                   md_fd = devopen (dev, O_RDONLY);
1013:                   if (md_fd < 0)
1014:                           silo_fatal("Could not open RAID device");
1015:                   if (ioctl (md_fd, GET_ARRAY_INFO, &md_array_info) < 0)
1016:                           silo_fatal("Could not get RAID array info");
1017:                   if (md_array_info.major_version == 0 && md_array_info.minor_version < 90)
1018:                           silo_fatal("Raid versions < 0.90 are not "
1019:                                      "supported");
1020:                   if (md_array_info.level != 1)
1021:                           silo_fatal("Only RAID1 supported");
1022:                   hwdev = NULL;
1023:                   last = NULL;
1024:                   for (i = 0; i < md_array_info.nr_disks; i++) {
1025:                           if (i == md_array_info.nr_disks - 1 && md_disk_info.majorno == 0 &&
1026:                               md_disk_info.minorno == 0)
1027:                                   break; // That's all folks
1028:                           md_disk_info.number = i;
1029:                           if (ioctl (md_fd, GET_DISK_INFO, &md_disk_info) < 0)
1030:                                   silo_fatal("Could not get RAID disk "
1031:                                              "info for disk %d\n", i);
1032:                           if(md_disk_info.majorno != 0 && md_disk_info.minorno != 0) {
1033:                                   d = get_device (md_disk_info.majorno, md_disk_info.minorno);
1034:                                   if (md_disk_info.state == MD_DISK_FAULTY) {
1035:                                           printf ("disk %s marked as faulty, skipping\n", d->dev);
1036:                                           continue;
1037:                                   }
1038:                                   if (hwdev)
1039:                                           last->next = d;
1040:                                   else
1041:                                           hwdev = d;
1042:                                   while (d->next != NULL) d = d->next;
1043:                                   last = d;
1044:                           }
1045:                   }
1046:                   if (!hwdev)
1047:                           silo_fatal("No non-faulty disks found "
1048:                                      "in RAID1");
1049:                   for (d = hwdev; d; d = d->next)
1050:                           d->id = id++;
1051:                   raid1 = id;
1052:                   close (md_fd);
1053:                   return hwdev;
1054:           }
----------------------------------------------------------------------------

'md_disk_info' structure created on line 1007 used uninitialised in 
'if' statement on line 1025. And because md_array_info.nr_disks = 1 in 
my case of degraded RAID1, SILO leaves the loop and goes directly to 
lines 1046 and then 1047, where aforementioned error message is printed. 
Because meaning of this 'if' on line 1025 was unclear to me, I simply 
commented it out, but still got the same result. 
After some investigations I found that md_array_info.nr_disks = 1 is 
number of good disks in the array. And since my HDD is second in the 
array from SILO's point of view, it couldn't be found in searching 
loop (lines 1024-1045). Also I discovered experimentally that total 
number of disks in array (both good and bad) seems to be stored in 
'md_array_info.raid_disks'. I replaced 'md_array_info.nr_disks' with 
'md_array_info.raid_disks' on line 1024 and SILO installed bootloader 
succesfully. 


So, that's all folks! 
Patch with all of my changes to silo.c attached. 
It works for me :)

Best regards,
Dmitry 'MAD' Artamonow

--- silo-1.4.13/silo/silo.c	2006-06-01 21:24:53.000000000 +0400
+++ silo-1.4.13-mad/silo/silo.c	2007-05-04 19:43:12.000000000 +0400
@@ -1021,10 +1021,7 @@
 				silo_fatal("Only RAID1 supported");
 			hwdev = NULL;
 			last = NULL;
-			for (i = 0; i < md_array_info.nr_disks; i++) {
-				if (i == md_array_info.nr_disks - 1 && md_disk_info.majorno == 0 &&
-				    md_disk_info.minorno == 0)
-					break; // That's all folks
+			for (i = 0; i < md_array_info.raid_disks; i++) {
 				md_disk_info.number = i;
 				if (ioctl (md_fd, GET_DISK_INFO, &md_disk_info) < 0)
 					silo_fatal("Could not get RAID disk "