Re: RAID5: failing an active component during spare rebuild - arrays hangs

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Sun, 26 Jun 2011 21:13:17 +0300

Hello Neil,
thank you for your response. Meanwhile I have moved to stock ubuntu
natty 11.04, but it still happens. I have a simple script that
reproduces the issue for me in less than 1 minute.
System details:
Linux ubuntu 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11 03:49:04 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux

Here is the script:
##################################
#!/bin/bash

while true
do
	mdadm --create /dev/md1123 --raid-devices=3 --level=5
--bitmap=internal --name=1123 --run --auto=md --metadata=1.2
--homehost=alex --verbose /dev/sda /dev/sdb /dev/sdc
	sleep 6
	mdadm --manage /dev/md1123 --fail /dev/sda
	sleep 1
	if mdadm --stop /dev/md1123
	then
		true
	else
		break
	fi
done
#####################################

And here is the output of one run. At the end of the output, the
--stop command fails and from that point I am unable to do anything
with the array, other than rebooting the machine.

root@ubuntu:/mnt/work/alex# ./repro.sh
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: layout defaults to left-symmetric
mdadm: /dev/sda appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:55:54 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdb appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:55:54 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdc appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:55:54 2011
mdadm: size set to 20969984K
mdadm: creation continuing despite oddities due to --run
mdadm: array /dev/md1123 started.
mdadm: set /dev/sda faulty in /dev/md1123
mdadm: stopped /dev/md1123
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: layout defaults to left-symmetric
mdadm: /dev/sda appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:57:45 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdb appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:57:45 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdc appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:57:45 2011
mdadm: size set to 20969984K
mdadm: creation continuing despite oddities due to --run
mdadm: array /dev/md1123 started.
mdadm: set /dev/sda faulty in /dev/md1123
mdadm: stopped /dev/md1123
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: layout defaults to left-symmetric
mdadm: /dev/sda appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:57:52 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdb appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:57:52 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdc appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sun Jun 26 20:57:52 2011
mdadm: size set to 20969984K
mdadm: creation continuing despite oddities due to --run
mdadm: array /dev/md1123 started.
mdadm: set /dev/sda faulty in /dev/md1123
mdadm: failed to stop array /dev/md1123: Device or resource busy
Perhaps a running process, mounted filesystem or active volume group?

At this point mdadm --detail produces:

/dev/md1123:
        Version : 1.2
  Creation Time : Sun Jun 26 20:57:59 2011
     Raid Level : raid5
     Array Size : 41939968 (40.00 GiB 42.95 GB)
  Used Dev Size : 20969984 (20.00 GiB 21.47 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jun 26 20:58:23 2011
          State : active, FAILED
 Active Devices : 1
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : alex:1123
           UUID : cd564563:94fecf52:5b3492d4:4530ecbc
         Events : 4

    Number   Major   Minor   RaidDevice State
       0       8        0        0      faulty spare rebuilding   /dev/sda
       1       8       16        1      active sync   /dev/sdb
       3       8       32        2      spare rebuilding   /dev/sdc

and the faulty device is not kicked out from the array, as I would expect.

Thanks,
  Alex.

On Wed, Jun 22, 2011 at 5:54 AM, NeilBrown <neilb@xxxxxxx> wrote:
>
> On Sun, 5 Jun 2011 22:41:55 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> wrote:
>
> > Hello everybody,
> > I am testing a scenario, in which I create a RAID5 with three devices:
> > /dev/sd{a,b,c}. Since I don't supply --force to mdadm during creation,
> > it treats the array as degraded and starts rebuilding the sdc as a
> > spare. This is as documented.
> >
> > Then I do --fail on /dev/sda. I understand that at this point my data
> > is gone, but I think should still be able to tear down the array.
> >
> > Sometimes I see that /dev/sda is kicked from the array as faulty, and
> > /dev/sdc is also removed and marked as a spare. Then I am able to tear
> > down the array.
> >
> > But sometimes, it looks like the system hits some kind of a deadlock.
>
> I cannot reproduce this, either on current mainline or 2.6.38.  I didn't try
> the particular Ubuntu kernel that you mentioned as I don't have any Ubuntu
> machines.
> It is unlikely that Ubuntu have broken something, but not impossible... are
> you able to compile a kernel.org kernel (preferably 2.6.39) and see if you
> can reproduce.
>
> Also, can you provide a simple script that will trigger the bug reliably for
> you.
>
> I did:
>
> while : ; do mdadm -CR /dev/md0 -l5 -n3 /dev/sd[abc] ; sleep 5; mdadm /dev/md0 -f /dev/sda ; mdadm -Ss ; echo ; echo; done
>
> and it has no problems at all.
>
> Certainly a deadlock shouldn't be happening...
>
>  From the stack trace you get it looks like it is probably hanging at
>
>        wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>
> which suggests that so resync request started and didn't complete.  I've
> never seen a hang there before.
>
> NeilBrown
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html