Re: RAID5: failing an active component during spare rebuild - arrays hangs

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Tue, 21 Jun 2011 11:05:09 +0300



Anyone???...

On Mon, Jun 6, 2011 at 9:19 PM, Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote:
>
> Hello,
>
> the kernel version is:
>
> root@ubuntu:~# uname -a
> Linux ubuntu 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11 03:49:04 UTC
> 2011 x86_64 x86_64 x86_64 GNU/Linux
>
> mdadm version is:
> root@ubuntu:~# mdadm -V
> mdadm - v3.1.4 - 31st August 2010
>
> Examining the three array components:
>
> root@ubuntu:~# mdadm -E /dev/sd{a,b,c}
> /dev/sda:
>          Magic : a92b4efc
>        Version : 1.2
>    Feature Map : 0x1
>     Array UUID : b5802763:fd4790dd:ee8bdeb2:2418097f
>           Name : vc:zvp_1123
>  Creation Time : Mon Jun  6 21:10:38 2011
>     Raid Level : raid5
>   Raid Devices : 3
>
>  Avail Dev Size : 41940992 (20.00 GiB 21.47 GB)
>     Array Size : 83879936 (40.00 GiB 42.95 GB)
>  Used Dev Size : 41939968 (20.00 GiB 21.47 GB)
>    Data Offset : 2048 sectors
>   Super Offset : 8 sectors
>          State : active
>    Device UUID : 8db90071:be80216e:09468262:1f5046b1
>
> Internal Bitmap : 8 sectors from superblock
>    Update Time : Mon Jun  6 21:10:46 2011
>       Checksum : 2e424556 - correct
>         Events : 10
>
>         Layout : left-symmetric
>     Chunk Size : 512K
>
>   Device Role : Active device 0
>   Array State : A.A ('A' == active, '.' == missing)
> /dev/sdb:
>          Magic : a92b4efc
>        Version : 1.2
>    Feature Map : 0x1
>     Array UUID : b5802763:fd4790dd:ee8bdeb2:2418097f
>           Name : vc:zvp_1123
>  Creation Time : Mon Jun  6 21:10:38 2011
>     Raid Level : raid5
>   Raid Devices : 3
>
>  Avail Dev Size : 41940992 (20.00 GiB 21.47 GB)
>     Array Size : 83879936 (40.00 GiB 42.95 GB)
>  Used Dev Size : 41939968 (20.00 GiB 21.47 GB)
>    Data Offset : 2048 sectors
>   Super Offset : 8 sectors
>          State : clean
>    Device UUID : 9f41313b:b1aa70f8:6cf0ca2f:c6ea0a64
>
> Internal Bitmap : 8 sectors from superblock
>    Update Time : Mon Jun  6 21:10:44 2011
>       Checksum : 2d23c61 - correct
>         Events : 8
>
>         Layout : left-symmetric
>     Chunk Size : 512K
>
>   Device Role : Active device 1
>   Array State : AAA ('A' == active, '.' == missing)
> /dev/sdc:
>          Magic : a92b4efc
>        Version : 1.2
>    Feature Map : 0x3
>     Array UUID : b5802763:fd4790dd:ee8bdeb2:2418097f
>           Name : vc:zvp_1123
>  Creation Time : Mon Jun  6 21:10:38 2011
>     Raid Level : raid5
>   Raid Devices : 3
>
>  Avail Dev Size : 41940992 (20.00 GiB 21.47 GB)
>     Array Size : 83879936 (40.00 GiB 42.95 GB)
>  Used Dev Size : 41939968 (20.00 GiB 21.47 GB)
>    Data Offset : 2048 sectors
>   Super Offset : 8 sectors
> Recovery Offset : 999424 sectors
>          State : active
>    Device UUID : 61189a9d:ec082cea:a3ba32fb:800fe84b
>
> Internal Bitmap : 8 sectors from superblock
>    Update Time : Mon Jun  6 21:10:46 2011
>       Checksum : a47a059 - correct
>         Events : 10
>
>         Layout : left-symmetric
>     Chunk Size : 512K
>
>   Device Role : Active device 2
>   Array State : A.A ('A' == active, '.' == missing)
>
> Details about the array:
>
> root@ubuntu:~#  mdadm -Q --detail /dev/md1123
> /dev/md1123:
>        Version : 1.2
>  Creation Time : Mon Jun  6 21:10:38 2011
>     Raid Level : raid5
>     Array Size : 41939968 (40.00 GiB 42.95 GB)
>  Used Dev Size : 20969984 (20.00 GiB 21.47 GB)
>   Raid Devices : 3
>  Total Devices : 3
>    Persistence : Superblock is persistent
>
>  Intent Bitmap : Internal
>
>    Update Time : Mon Jun  6 21:10:46 2011
>          State : active, FAILED
>  Active Devices : 1
> Working Devices : 2
>  Failed Devices : 1
>  Spare Devices : 1
>
>         Layout : left-symmetric
>     Chunk Size : 512K
>
>           Name : vc:zvp_1123
>           UUID : b5802763:fd4790dd:ee8bdeb2:2418097f
>         Events : 10
>
>    Number   Major   Minor   RaidDevice State
>       0       8        0        0      active sync   /dev/sda
>       1       8       16        1      faulty spare rebuilding   /dev/sdb
>       3       8       32        2      spare rebuilding   /dev/sdc
>
>
> Basically, the thing is that the faulty (and the rebuilding spare)
> component are not kicked out of the array, and the array is stuck in
> this state.
>
> Thanks,
>  Alex.
>
>
> 2011/6/6 Nagilum <nagilum@xxxxxxxxxxx>:
> > Make sure you provide all relevant details such as kernel version, mdadm
> > version and maybe also mdadm -E /dev/sd{a,b,c}, mdadm -Q --detail /dev/md0,
> > ..
> >
> > ----- Message from alex.bolshoy@xxxxxxxxx ---------
> >    Date: Sun, 5 Jun 2011 22:41:55 +0300
> >    From: Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> >  Subject: RAID5: failing an active component during spare rebuild - arrays
> > hangs
> >      To: linux-raid@xxxxxxxxxxxxxxx
> >
> >
> >> Hello everybody,
> >> I am testing a scenario, in which I create a RAID5 with three devices:
> >> /dev/sd{a,b,c}. Since I don't supply --force to mdadm during creation,
> >> it treats the array as degraded and starts rebuilding the sdc as a
> >> spare. This is as documented.
> >>
> >> Then I do --fail on /dev/sda. I understand that at this point my data
> >> is gone, but I think should still be able to tear down the array.
> >>
> >> Sometimes I see that /dev/sda is kicked from the array as faulty, and
> >> /dev/sdc is also removed and marked as a spare. Then I am able to tear
> >> down the array.
> >>
> >> But sometimes, it looks like the system hits some kind of a deadlock.
> >> mdadm --detail produces:
> >>
> >>     Update Time : Sun Jun  5 21:54:34 2011
> >>           State : active, FAILED
> >>  Active Devices : 1
> >> Working Devices : 2
> >>  Failed Devices : 1
> >>   Spare Devices : 1
> >>
> >>          Layout : left-symmetric
> >>      Chunk Size : 512K
> >>
> >>            Name : ubuntu:zvp_1123
> >>            UUID : 48a15fb6:b6410bb9:a2ca173e:0092032c
> >>          Events : 67
> >>
> >>     Number   Major   Minor   RaidDevice State
> >>        0       8        0        0      faulty spare rebuilding   /dev/sda
> >>        1       8       16        1      active sync   /dev/sdb
> >>        3       8       32        2      spare rebuilding   /dev/sdc
> >>
> >> So the faulty device and the spare are not kicked out of the array. At
> >> this point I am unable to do anything with the array:
> >>
> >> root@ubuntu:~# sudo mdadm --stop /dev/md1123
> >> mdadm: failed to stop array /dev/md1123: Device or resource busy
> >> Perhaps a running process, mounted filesystem or active volume group?
> >> root@ubuntu:~# sudo mdadm /dev/md1123 --remove /dev/sda
> >> mdadm: hot remove failed for /dev/sda: Device or resource busy
> >> root@ubuntu:~# sudo mdadm /dev/md1123 --remove /dev/sdb
> >> mdadm: hot remove failed for /dev/sdb: Device or resource busy
> >> root@ubuntu:~# sudo mdadm /dev/md1123 --remove /dev/sdc
> >> mdadm: hot remove failed for /dev/sdc: Device or resource busy
> >>
> >> This is happening on ubuntu-natty, with mdadm - v3.1.4 - 31st August 2010.
> >> Looking at some code in mdadm/Detail.c, it looks like /dev/sda has
> >> been marked only as MD_DISK_FAULTY, but has not yet been kicked out of
> >> the array. The "spare" and "rebuilding" prints also result from that.
> >>
> >> Same thing also happens (sometimes) when I manually initiate resync
> >> (by writing 'repair' to 'sync_action'), and later manually failing one
> >> of the devices. Then I also saw messages like this in the syslog:
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350454] INFO: task
> >> md1123_resync:7993 blocked for more than 120 seconds.
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350552] "echo 0 >
> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350644] md1123_resync   D
> >> 0000000000000000     0  7993      2 0x00000004
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350647]  ffff8800b56b1cd0
> >> 0000000000000046 ffff8800b56b1fd8 ffff8800b56b0000
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350649]  0000000000013d00
> >> ffff880036c09a98 ffff8800b56b1fd8 0000000000013d00
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350652]  ffff8800b7f1adc0
> >> ffff880036c096e0 ffff8800b56b1cb0 ffff880036c56610
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350654] Call Trace:
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350657]  [<ffffffff81492885>]
> >> md_do_sync+0xb45/0xc90
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350660]  [<ffffffff81087940>] ?
> >> autoremove_wake_function+0x0/0x40
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350663]  [<ffffffff8107861b>] ?
> >> recalc_sigpending+0x1b/0x50
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350665]  [<ffffffff8148c516>]
> >> md_thread+0x116/0x150
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350667]  [<ffffffff8148c400>] ?
> >> md_thread+0x0/0x150
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350669]  [<ffffffff810871f6>]
> >> kthread+0x96/0xa0
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350672]  [<ffffffff8100cde4>]
> >> kernel_thread_helper+0x4/0x10
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350674]  [<ffffffff81087160>] ?
> >> kthread+0x0/0xa0
> >> Jun  5 21:42:00 ubuntu kernel: [ 2280.350676]  [<ffffffff8100cde0>] ?
> >> kernel_thread_helper+0x0/0x10
> >>
> >> This is pretty easy for me to reproduce.
> >>
> >> Basically, I would like to know what the user is expected to do when
> >> more than one RAID5 array component fails during rebuild/resync.
> >>
> >> Thanks,
> >>   Alex.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> >
> > ----- End message from alex.bolshoy@xxxxxxxxx -----
> >
> >
> >
> > ========================================================================
> > #    _  __          _ __     http://www.nagilum.org/ \n icq://69646724 #
> > #   / |/ /__ ____ _(_) /_ ____ _  nagilum@xxxxxxxxxxx \n +491776461165 #
> > #  /    / _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
> > # /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
> > #           /___/     x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
> > ========================================================================
> >
> >
> > ----------------------------------------------------------------
> > cakebox.homeunix.net - all the machine one needs..
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html