Re: Problems with raid after reboot.

Matthew Tice <mjtice@xxxxxxxxx> · Mon, 25 Jul 2011 15:55:02 -0600



On Mon, Jul 25, 2011 at 3:42 PM, Matthew Tice <mjtice@xxxxxxxxx> wrote:
> On Mon, Jul 25, 2011 at 3:33 PM, Matthew Tice <mjtice@xxxxxxxxx> wrote:
>> On Mon, Jul 25, 2011 at 3:30 PM, Matthew Tice <mjtice@xxxxxxxxx> wrote:
>>> On Mon, Jul 25, 2011 at 3:21 PM, Robin Hill <robin@xxxxxxxxxxxxxxx> wrote:
>>>> On Mon Jul 25, 2011 at 03:04:34PM -0600, Matthew Tice wrote:
>>>>
>>>>> Well things are a lot different now - I'm unable to start the array
>>>>> successfully.  I removed an older non-relevant drive that was giving
>>>>> me smart errors - when I rebooted the drive assignments shifted (not
>>>>> sure this really matters, though).
>>>>>
>>>>> Now when I try to start the array I get:
>>>>>
>>>>> # mdadm -A -f /dev/md0
>>>>> mdadm: no devices found for /dev/md0
>>>>>
>>>>> I can nudge it slightly with auto-detect:
>>>>>
>>>>> # mdadm --auto-detect
>>>>>
>>>>> Then I try to assemble the array with:
>>>>>
>>>>> # mdadm -A -f /dev/md0 /dev/sd[bcde]
>>>>> mdadm: cannot open device /dev/sde: Device or resource busy
>>>>> mdadm: /dev/sde has no superblock - assembly aborted
>>>>>
>>>> <- SNIP ->
>>>>>  │  └─sde: [8:64] MD raid5 (none/4) 931.51g md_d0 inactive spare
>>>> <- SNIP ->
>>>>>
>>>>> I've looked but I'm unable to find where the drive is in use.
>>>>
>>>> lsdrv shows that it's in use in array md_d0 - presumably this is a
>>>> part-assembled array (possibly auto-assembled by the kernel). Try
>>>> stopping that first, then doing the "mdadm -A -f /dev/md0 /dev/sd[bcde]"
>>>>
>>>
>>> Nice catch, thanks, Robin.
>>>
>>> I stopped /dev/md_d0 then started the array on /dev/md0
>>>
>>> # mdadm -A -f /dev/md0 /dev/sd[bcde]
>>> mdadm: /dev/md0 has been started with 3 drives (out of 4).
>>>
>>> It's only seeing the three drives.  I did an fsck on it just in case
>>> but it failed:
>>>
>>> # fsck -n /dev/md0
>>> fsck from util-linux-ng 2.17.2
>>> e2fsck 1.41.12 (17-May-2010)
>>> Superblock has an invalid journal (inode 8).
>>> Clear? no
>>>
>>> fsck.ext4: Illegal inode number while checking ext3 journal for /dev/md0
>>>
>>> Looks like /dev/sde is missing (as also noted above):
>>>
>>> # mdadm --detail /dev/md0
>>> /dev/md0:
>>>        Version : 00.90
>>>  Creation Time : Sat Mar 12 21:22:34 2011
>>>     Raid Level : raid5
>>>     Array Size : 2197723392 (2095.91 GiB 2250.47 GB)
>>>  Used Dev Size : 732574464 (698.64 GiB 750.16 GB)
>>>   Raid Devices : 4
>>>  Total Devices : 3
>>> Preferred Minor : 0
>>>    Persistence : Superblock is persistent
>>>
>>>    Update Time : Mon Jul 25 14:08:30 2011
>>>          State : clean, degraded
>>>  Active Devices : 3
>>> Working Devices : 3
>>>  Failed Devices : 0
>>>  Spare Devices : 0
>>>
>>>         Layout : left-symmetric
>>>     Chunk Size : 64K
>>>
>>>           UUID : daf06d5a:b80528b1:2e29483d:f114274d (local to host storage)
>>>         Events : 0.5593
>>>
>>>    Number   Major   Minor   RaidDevice State
>>>       0       0        0        0      removed
>>>       1       8       48        1      active sync   /dev/sdd
>>>       2       8       32        2      active sync   /dev/sdc
>>>       3       8       16        3      active sync   /dev/sdb
>>>
>>
>> One other strange thing I just noticed - /dev/sde keeps getting added
>> back into /dev/md_d0 (after I start the array on /dev/md0)
>>
>> # /usr/local/bin/lsdrv
>> **Warning** The following utility(ies) failed to execute:
>>  pvs
>>  lvs
>> Some information may be missing.
>>
>> PCI [ata_piix] 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7
>> Family) IDE Controller (rev 01)
>>  ├─scsi 0:0:0:0 LITE-ON COMBO SOHC-4836K {2006061700044437}
>>  │  └─sr0: [11:0] Empty/Unknown 1.00g
>>  └─scsi 1:x:x:x [Empty]
>> PCI [ata_piix] 00:1f.2 IDE interface: Intel Corporation N10/ICH7
>> Family SATA IDE Controller (rev 01)
>>  ├─scsi 2:x:x:x [Empty]
>>  └─scsi 3:0:0:0 ATA HDS728080PLA380 {PFDB20S4SNLT6J}
>>    └─sda: [8:0] Partitioned (dos) 76.69g
>>       ├─sda1: [8:1] (ext4) 75.23g {960433b3-af56-41bd-bb9a-d0a0fb5ffb45}
>>       │  └─Mounted as
>> /dev/disk/by-uuid/960433b3-af56-41bd-bb9a-d0a0fb5ffb45 @ /
>>       ├─sda2: [8:2] Partitioned (dos) 1.00k
>>       └─sda5: [8:5] (swap) 1.46g {10c3b226-16d4-44ea-ad1e-6296bb92969d}
>> PCI [sata_sil24] 04:00.0 RAID bus controller: Silicon Image, Inc. SiI
>> 3132 Serial ATA Raid II Controller (rev 01)
>>  ├─scsi 4:0:0:0 ATA WDC WD7500AADS-0 {WD-WCAV59574584}
>>  │  └─sdb: [8:16] MD raid5 (3/4) 698.64g md0 clean in_sync
>> {daf06d5a-b805-28b1-2e29-483df114274d}
>>  │     └─md0: [9:0] (ext3) 2.05t {a9a38e8e-d54d-407d-a786-31410ad6e17d}
>>  ├─scsi 4:1:0:0 ATA WDC WD7500AADS-0 {WD-WCAV59459025}
>>  │  └─sdc: [8:32] MD raid5 (2/4) 698.64g md0 clean in_sync
>> {daf06d5a-b805-28b1-2e29-483df114274d}
>>  ├─scsi 4:2:0:0 ATA Hitachi HDS72101 {JP9911HZ1SKHNU}
>>  │  └─sdd: [8:48] MD raid5 (1/4) 931.51g md0 clean in_sync
>> {daf06d5a-b805-28b1-2e29-483df114274d}
>>  ├─scsi 4:3:0:0 ATA Hitachi HDS72101 {JP9960HZ1VK96U}
>>  │  └─sde: [8:64] MD raid5 (none/4) 931.51g md_d0 inactive spare
>> {daf06d5a-b805-28b1-2e29-483df114274d}
>>  │     └─md_d0: [254:0] Empty/Unknown 0.00k
>>  └─scsi 7:x:x:x [Empty]
>>
>
> Here is something interesting from syslog:
>
> 1. I stop /dev/md_d0
> Jul 25 15:38:56 localhost kernel: [ 4272.658244] md: md_d0 stopped.
> Jul 25 15:38:56 localhost kernel: [ 4272.658258] md: unbind<sde>
> Jul 25 15:38:56 localhost kernel: [ 4272.658271] md: export_rdev(sde)
>
> 2. I assemble /dev/md0 with:
> # mdadm -A /dev/md0 /dev/sd[bcde]
> mdadm: /dev/md0 has been started with 3 drives (out of 4).
>
> Jul 25 15:41:33 localhost kernel: [ 4429.537035] md: md0 stopped.
> Jul 25 15:41:33 localhost kernel: [ 4429.545447] md: bind<sde>
> Jul 25 15:41:33 localhost kernel: [ 4429.545644] md: bind<sdc>
> Jul 25 15:41:33 localhost kernel: [ 4429.545810] md: bind<sdb>
> Jul 25 15:41:33 localhost kernel: [ 4429.546827] md: bind<sdd>
> Jul 25 15:41:33 localhost kernel: [ 4429.546876] md: kicking non-fresh
> sde from array!
> Jul 25 15:41:33 localhost kernel: [ 4429.546883] md: unbind<sde>
> Jul 25 15:41:33 localhost kernel: [ 4429.546890] md: export_rdev(sde)
> Jul 25 15:41:33 localhost kernel: [ 4429.565035] md/raid:md0: device
> sdd operational as raid disk 1
> Jul 25 15:41:33 localhost kernel: [ 4429.565041] md/raid:md0: device
> sdb operational as raid disk 3
> Jul 25 15:41:33 localhost kernel: [ 4429.565045] md/raid:md0: device
> sdc operational as raid disk 2
> Jul 25 15:41:33 localhost kernel: [ 4429.565631] md/raid:md0: allocated 4222kB
> Jul 25 15:41:33 localhost kernel: [ 4429.573438] md/raid:md0: raid
> level 5 active with 3 out of 4 devices, algorithm 2
> Jul 25 15:41:33 localhost kernel: [ 4429.574754] RAID conf printout:
> Jul 25 15:41:33 localhost kernel: [ 4429.574757]  --- level:5 rd:4 wd:3
> Jul 25 15:41:33 localhost kernel: [ 4429.574761]  disk 1, o:1, dev:sdd
> Jul 25 15:41:33 localhost kernel: [ 4429.574765]  disk 2, o:1, dev:sdc
> Jul 25 15:41:33 localhost kernel: [ 4429.574768]  disk 3, o:1, dev:sdb
> Jul 25 15:41:33 localhost kernel: [ 4429.574863] md0: detected
> capacity change from 0 to 2250468753408
> Jul 25 15:41:33 localhost kernel: [ 4429.575092]  md0: unknown partition table
> Jul 25 15:41:33 localhost kernel: [ 4429.626140] md: bind<sde>
>
> So /dev/sde is "non-fresh" and has an unknown partition table.
>

Okay, I was able to add it back in by stopping this /dev/md_d0 and then:
# mdadm /dev/md0 --add /dev/sde
mdadm: re-added /dev/sde

So now it's syncing:
# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Sat Mar 12 21:22:34 2011
     Raid Level : raid5
     Array Size : 2197723392 (2095.91 GiB 2250.47 GB)
  Used Dev Size : 732574464 (698.64 GiB 750.16 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Jul 25 15:52:29 2011
          State : clean, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 0% complete

           UUID : daf06d5a:b80528b1:2e29483d:f114274d (local to host storage)
         Events : 0.5599

    Number   Major   Minor   RaidDevice State
       4       8       64        0      spare rebuilding   /dev/sde
       1       8       48        1      active sync   /dev/sdd
       2       8       32        2      active sync   /dev/sdc
       3       8       16        3      active sync   /dev/sdb

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid5 sde[4] sdd[1] sdb[3] sdc[2]
      2197723392 blocks level 5, 64k chunk, algorithm 2 [4/3] [_UUU]
      [>....................]  recovery =  0.4% (3470464/732574464)
finish=365.0min speed=33284K/sec

unused devices: <none>


However, it's still failing an fsck - so does order matter when I
re-assemble the array?  I see conflicting answers online.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html