Re: RAID5: failing an active component during spare rebuild - arrays hangs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

the kernel version is:

root@ubuntu:~# uname -a
Linux ubuntu 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11 03:49:04 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux

mdadm version is:
root@ubuntu:~# mdadm -V
mdadm - v3.1.4 - 31st August 2010

Examining the three array components:

root@ubuntu:~# mdadm -E /dev/sd{a,b,c}
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : b5802763:fd4790dd:ee8bdeb2:2418097f
           Name : vc:zvp_1123
  Creation Time : Mon Jun  6 21:10:38 2011
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 41940992 (20.00 GiB 21.47 GB)
     Array Size : 83879936 (40.00 GiB 42.95 GB)
  Used Dev Size : 41939968 (20.00 GiB 21.47 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 8db90071:be80216e:09468262:1f5046b1

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Jun  6 21:10:46 2011
       Checksum : 2e424556 - correct
         Events : 10

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A.A ('A' == active, '.' == missing)
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : b5802763:fd4790dd:ee8bdeb2:2418097f
           Name : vc:zvp_1123
  Creation Time : Mon Jun  6 21:10:38 2011
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 41940992 (20.00 GiB 21.47 GB)
     Array Size : 83879936 (40.00 GiB 42.95 GB)
  Used Dev Size : 41939968 (20.00 GiB 21.47 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 9f41313b:b1aa70f8:6cf0ca2f:c6ea0a64

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Jun  6 21:10:44 2011
       Checksum : 2d23c61 - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAA ('A' == active, '.' == missing)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x3
     Array UUID : b5802763:fd4790dd:ee8bdeb2:2418097f
           Name : vc:zvp_1123
  Creation Time : Mon Jun  6 21:10:38 2011
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 41940992 (20.00 GiB 21.47 GB)
     Array Size : 83879936 (40.00 GiB 42.95 GB)
  Used Dev Size : 41939968 (20.00 GiB 21.47 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
Recovery Offset : 999424 sectors
          State : active
    Device UUID : 61189a9d:ec082cea:a3ba32fb:800fe84b

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Jun  6 21:10:46 2011
       Checksum : a47a059 - correct
         Events : 10

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : A.A ('A' == active, '.' == missing)

Details about the array:

root@ubuntu:~#  mdadm -Q --detail /dev/md1123
/dev/md1123:
        Version : 1.2
  Creation Time : Mon Jun  6 21:10:38 2011
     Raid Level : raid5
     Array Size : 41939968 (40.00 GiB 42.95 GB)
  Used Dev Size : 20969984 (20.00 GiB 21.47 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Jun  6 21:10:46 2011
          State : active, FAILED
 Active Devices : 1
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : vc:zvp_1123
           UUID : b5802763:fd4790dd:ee8bdeb2:2418097f
         Events : 10

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       16        1      faulty spare rebuilding   /dev/sdb
       3       8       32        2      spare rebuilding   /dev/sdc


Basically, the thing is that the faulty (and the rebuilding spare)
component are not kicked out of the array, and the array is stuck in
this state.

Thanks,
  Alex.


2011/6/6 Nagilum <nagilum@xxxxxxxxxxx>:
> Make sure you provide all relevant details such as kernel version, mdadm
> version and maybe also mdadm -E /dev/sd{a,b,c}, mdadm -Q --detail /dev/md0,
> ..
>
> ----- Message from alex.bolshoy@xxxxxxxxx ---------
>    Date: Sun, 5 Jun 2011 22:41:55 +0300
>    From: Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
>  Subject: RAID5: failing an active component during spare rebuild - arrays
> hangs
>      To: linux-raid@xxxxxxxxxxxxxxx
>
>
>> Hello everybody,
>> I am testing a scenario, in which I create a RAID5 with three devices:
>> /dev/sd{a,b,c}. Since I don't supply --force to mdadm during creation,
>> it treats the array as degraded and starts rebuilding the sdc as a
>> spare. This is as documented.
>>
>> Then I do --fail on /dev/sda. I understand that at this point my data
>> is gone, but I think should still be able to tear down the array.
>>
>> Sometimes I see that /dev/sda is kicked from the array as faulty, and
>> /dev/sdc is also removed and marked as a spare. Then I am able to tear
>> down the array.
>>
>> But sometimes, it looks like the system hits some kind of a deadlock.
>> mdadm --detail produces:
>>
>>     Update Time : Sun Jun  5 21:54:34 2011
>>           State : active, FAILED
>>  Active Devices : 1
>> Working Devices : 2
>>  Failed Devices : 1
>>   Spare Devices : 1
>>
>>          Layout : left-symmetric
>>      Chunk Size : 512K
>>
>>            Name : ubuntu:zvp_1123
>>            UUID : 48a15fb6:b6410bb9:a2ca173e:0092032c
>>          Events : 67
>>
>>     Number   Major   Minor   RaidDevice State
>>        0       8        0        0      faulty spare rebuilding   /dev/sda
>>        1       8       16        1      active sync   /dev/sdb
>>        3       8       32        2      spare rebuilding   /dev/sdc
>>
>> So the faulty device and the spare are not kicked out of the array. At
>> this point I am unable to do anything with the array:
>>
>> root@ubuntu:~# sudo mdadm --stop /dev/md1123
>> mdadm: failed to stop array /dev/md1123: Device or resource busy
>> Perhaps a running process, mounted filesystem or active volume group?
>> root@ubuntu:~# sudo mdadm /dev/md1123 --remove /dev/sda
>> mdadm: hot remove failed for /dev/sda: Device or resource busy
>> root@ubuntu:~# sudo mdadm /dev/md1123 --remove /dev/sdb
>> mdadm: hot remove failed for /dev/sdb: Device or resource busy
>> root@ubuntu:~# sudo mdadm /dev/md1123 --remove /dev/sdc
>> mdadm: hot remove failed for /dev/sdc: Device or resource busy
>>
>> This is happening on ubuntu-natty, with mdadm - v3.1.4 - 31st August 2010.
>> Looking at some code in mdadm/Detail.c, it looks like /dev/sda has
>> been marked only as MD_DISK_FAULTY, but has not yet been kicked out of
>> the array. The "spare" and "rebuilding" prints also result from that.
>>
>> Same thing also happens (sometimes) when I manually initiate resync
>> (by writing 'repair' to 'sync_action'), and later manually failing one
>> of the devices. Then I also saw messages like this in the syslog:
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350454] INFO: task
>> md1123_resync:7993 blocked for more than 120 seconds.
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350552] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350644] md1123_resync   D
>> 0000000000000000     0  7993      2 0x00000004
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350647]  ffff8800b56b1cd0
>> 0000000000000046 ffff8800b56b1fd8 ffff8800b56b0000
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350649]  0000000000013d00
>> ffff880036c09a98 ffff8800b56b1fd8 0000000000013d00
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350652]  ffff8800b7f1adc0
>> ffff880036c096e0 ffff8800b56b1cb0 ffff880036c56610
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350654] Call Trace:
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350657]  [<ffffffff81492885>]
>> md_do_sync+0xb45/0xc90
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350660]  [<ffffffff81087940>] ?
>> autoremove_wake_function+0x0/0x40
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350663]  [<ffffffff8107861b>] ?
>> recalc_sigpending+0x1b/0x50
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350665]  [<ffffffff8148c516>]
>> md_thread+0x116/0x150
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350667]  [<ffffffff8148c400>] ?
>> md_thread+0x0/0x150
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350669]  [<ffffffff810871f6>]
>> kthread+0x96/0xa0
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350672]  [<ffffffff8100cde4>]
>> kernel_thread_helper+0x4/0x10
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350674]  [<ffffffff81087160>] ?
>> kthread+0x0/0xa0
>> Jun  5 21:42:00 ubuntu kernel: [ 2280.350676]  [<ffffffff8100cde0>] ?
>> kernel_thread_helper+0x0/0x10
>>
>> This is pretty easy for me to reproduce.
>>
>> Basically, I would like to know what the user is expected to do when
>> more than one RAID5 array component fails during rebuild/resync.
>>
>> Thanks,
>>   Alex.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
> ----- End message from alex.bolshoy@xxxxxxxxx -----
>
>
>
> ========================================================================
> #    _  __          _ __     http://www.nagilum.org/ \n icq://69646724 #
> #   / |/ /__ ____ _(_) /_ ____ _  nagilum@xxxxxxxxxxx \n +491776461165 #
> #  /    / _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
> # /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
> #           /___/     x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
> ========================================================================
>
>
> ----------------------------------------------------------------
> cakebox.homeunix.net - all the machine one needs..
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux