md raid1 chokes when one disk is removed

Danny Howard <dannyman@xxxxxxxxxx> · Thu, 10 Nov 2005 12:58:26 -0800

Hello,

I am evaluating RHEL, prior to purchase for a new production network.
Our boxes are SuperMicro 6018HT with dual SATA drives.

I like to give my system a bit of added resiliency with RAID1.  These
systems have pairs of SATA disks, but no hardware RAID.  With FreeBSD, I
can set up a gmirror and have a RAID1 system.  (I have documentation on
that at
http://dannyman.toldme.com/2005/01/24/freebsd-howto-gmirror-system/ )
So, for Red Hat, I checked the manual, and thought I'd give the Red Hat
method a shot.

Here's a capture of my Disk Druid:
http://www.flickr.com/photos/dannyman/61643870/

And, here's some info from the running system:
[root@linux ~]# cat /etc/fstab 
# This file is edited by fstab-sync - see 'man fstab-sync' for details
/dev/md2                /                       ext3    defaults        1 1
/dev/md0                /boot                   ext3    defaults        1 2
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /dev/shm                tmpfs   defaults        0 0
none                    /proc                   proc    defaults        0 0
none                    /sys                    sysfs   defaults        0 0
/dev/md1                swap                    swap    defaults        0 0
/dev/hdc                /media/cdrom            auto    pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed 0 0
/dev/fd0                /media/floppy           auto    pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed 0 0
[root@linux ~]# mount
/dev/md2 on / type ext3 (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/md0 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

[root@linux ~]# cat /etc/mdadm.conf

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root
ARRAY /dev/md2 super-minor=2
ARRAY /dev/md0 super-minor=0
ARRAY /dev/md1 super-minor=1
[root@linux ~]# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdb2[1] sda2[0]
      2032128 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
      76011456 blocks [2/2] [UU]

md0 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]

unused devices: <none>

Sweet!  I can "fail" a disk and remove it thus:
mdadm --fail /dev/md0 /dev/sdb1
mdadm --fail /dev/md1 /dev/sdb2
mdadm --fail /dev/md2 /dev/sdb3
[ ... physically remove disk, system is fine ... ]
[ ... put the disk back in, system is fine ... ]
mdadm --remove /dev/md0 /dev/sdb1
mdadm --add /dev/md0 /dev/sdb1
mdadm --remove /dev/md1 /dev/sdb2
mdadm --add /dev/md1 /dev/sdb2
mdadm --remove /dev/md2 /dev/sdb3
mdadm --add /dev/md2 /dev/sdb3
[ ... md2 does a rebuild, but /boot and <swap> are fine -- nice! ... ]

Okay, but what if a disk fails on its own?

[root@linux ~]# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdb2[1] sda2[0]
      2032128 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
      76011456 blocks [2/2] [UU]

md0 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]

unused devices: <none>
[ ... pull sdb ... ]
[root@linux ~]# cat /proc/mdstat
ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61
ata1: status=0xd0 { Busy }
SCSI error : <0 0 1 0> return code = 0x8000002
Current sdb: sense key Aborted Command
Additional sense: Scsi parity error
end_request: I/O error, dev sdb, sector 156296202
md: write_disk_sb failed for device sdb3
ATA: abnormal status 0xD0 on port 0x1F7
md: errors occurred during superblock update, repeating
ATA: abnormal status 0xD0 on port 0x1F7
ATA: abnormal status 0xD0 on port 0x1F7
ata1: command 0x35 timeout, stat 0x50 host_stat 0x61
[ ... reinsert sdb ... ]
Personalities : [raid1] 
md1 : active raid1 sdb2[1] sda2[0]
      2032128 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
      76011456 blocks [2/2] [UU]

md0 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]

unused devices: <none>

I don't like that the system seems to choke when the disk is removed
unexpectedly.  Is this intended operation?  Do I need to massage my SCSI
subsystem a bit?  What's up? :)

Thanks for you time.

Sincerely,
-danny

-- 
http://dannyman.toldme.com/

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list