Re: Problems with raid after reboot.

Matthew Tice <mjtice@xxxxxxxxx> · Mon, 25 Jul 2011 15:42:09 -0600

On Mon, Jul 25, 2011 at 3:33 PM, Matthew Tice <mjtice@xxxxxxxxx> wrote:
> On Mon, Jul 25, 2011 at 3:30 PM, Matthew Tice <mjtice@xxxxxxxxx> wrote:
>> On Mon, Jul 25, 2011 at 3:21 PM, Robin Hill <robin@xxxxxxxxxxxxxxx> wrote:
>>> On Mon Jul 25, 2011 at 03:04:34PM -0600, Matthew Tice wrote:
>>>
>>>> Well things are a lot different now - I'm unable to start the array
>>>> successfully.  I removed an older non-relevant drive that was giving
>>>> me smart errors - when I rebooted the drive assignments shifted (not
>>>> sure this really matters, though).
>>>>
>>>> Now when I try to start the array I get:
>>>>
>>>> # mdadm -A -f /dev/md0
>>>> mdadm: no devices found for /dev/md0
>>>>
>>>> I can nudge it slightly with auto-detect:
>>>>
>>>> # mdadm --auto-detect
>>>>
>>>> Then I try to assemble the array with:
>>>>
>>>> # mdadm -A -f /dev/md0 /dev/sd[bcde]
>>>> mdadm: cannot open device /dev/sde: Device or resource busy
>>>> mdadm: /dev/sde has no superblock - assembly aborted
>>>>
>>> <- SNIP ->
>>>>  │  └─sde: [8:64] MD raid5 (none/4) 931.51g md_d0 inactive spare
>>> <- SNIP ->
>>>>
>>>> I've looked but I'm unable to find where the drive is in use.
>>>
>>> lsdrv shows that it's in use in array md_d0 - presumably this is a
>>> part-assembled array (possibly auto-assembled by the kernel). Try
>>> stopping that first, then doing the "mdadm -A -f /dev/md0 /dev/sd[bcde]"
>>>
>>
>> Nice catch, thanks, Robin.
>>
>> I stopped /dev/md_d0 then started the array on /dev/md0
>>
>> # mdadm -A -f /dev/md0 /dev/sd[bcde]
>> mdadm: /dev/md0 has been started with 3 drives (out of 4).
>>
>> It's only seeing the three drives.  I did an fsck on it just in case
>> but it failed:
>>
>> # fsck -n /dev/md0
>> fsck from util-linux-ng 2.17.2
>> e2fsck 1.41.12 (17-May-2010)
>> Superblock has an invalid journal (inode 8).
>> Clear? no
>>
>> fsck.ext4: Illegal inode number while checking ext3 journal for /dev/md0
>>
>> Looks like /dev/sde is missing (as also noted above):
>>
>> # mdadm --detail /dev/md0
>> /dev/md0:
>>        Version : 00.90
>>  Creation Time : Sat Mar 12 21:22:34 2011
>>     Raid Level : raid5
>>     Array Size : 2197723392 (2095.91 GiB 2250.47 GB)
>>  Used Dev Size : 732574464 (698.64 GiB 750.16 GB)
>>   Raid Devices : 4
>>  Total Devices : 3
>> Preferred Minor : 0
>>    Persistence : Superblock is persistent
>>
>>    Update Time : Mon Jul 25 14:08:30 2011
>>          State : clean, degraded
>>  Active Devices : 3
>> Working Devices : 3
>>  Failed Devices : 0
>>  Spare Devices : 0
>>
>>         Layout : left-symmetric
>>     Chunk Size : 64K
>>
>>           UUID : daf06d5a:b80528b1:2e29483d:f114274d (local to host storage)
>>         Events : 0.5593
>>
>>    Number   Major   Minor   RaidDevice State
>>       0       0        0        0      removed
>>       1       8       48        1      active sync   /dev/sdd
>>       2       8       32        2      active sync   /dev/sdc
>>       3       8       16        3      active sync   /dev/sdb
>>
>
> One other strange thing I just noticed - /dev/sde keeps getting added
> back into /dev/md_d0 (after I start the array on /dev/md0)
>
> # /usr/local/bin/lsdrv
> **Warning** The following utility(ies) failed to execute:
>  pvs
>  lvs
> Some information may be missing.
>
> PCI [ata_piix] 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7
> Family) IDE Controller (rev 01)
>  ├─scsi 0:0:0:0 LITE-ON COMBO SOHC-4836K {2006061700044437}
>  │  └─sr0: [11:0] Empty/Unknown 1.00g
>  └─scsi 1:x:x:x [Empty]
> PCI [ata_piix] 00:1f.2 IDE interface: Intel Corporation N10/ICH7
> Family SATA IDE Controller (rev 01)
>  ├─scsi 2:x:x:x [Empty]
>  └─scsi 3:0:0:0 ATA HDS728080PLA380 {PFDB20S4SNLT6J}
>    └─sda: [8:0] Partitioned (dos) 76.69g
>       ├─sda1: [8:1] (ext4) 75.23g {960433b3-af56-41bd-bb9a-d0a0fb5ffb45}
>       │  └─Mounted as
> /dev/disk/by-uuid/960433b3-af56-41bd-bb9a-d0a0fb5ffb45 @ /
>       ├─sda2: [8:2] Partitioned (dos) 1.00k
>       └─sda5: [8:5] (swap) 1.46g {10c3b226-16d4-44ea-ad1e-6296bb92969d}
> PCI [sata_sil24] 04:00.0 RAID bus controller: Silicon Image, Inc. SiI
> 3132 Serial ATA Raid II Controller (rev 01)
>  ├─scsi 4:0:0:0 ATA WDC WD7500AADS-0 {WD-WCAV59574584}
>  │  └─sdb: [8:16] MD raid5 (3/4) 698.64g md0 clean in_sync
> {daf06d5a-b805-28b1-2e29-483df114274d}
>  │     └─md0: [9:0] (ext3) 2.05t {a9a38e8e-d54d-407d-a786-31410ad6e17d}
>  ├─scsi 4:1:0:0 ATA WDC WD7500AADS-0 {WD-WCAV59459025}
>  │  └─sdc: [8:32] MD raid5 (2/4) 698.64g md0 clean in_sync
> {daf06d5a-b805-28b1-2e29-483df114274d}
>  ├─scsi 4:2:0:0 ATA Hitachi HDS72101 {JP9911HZ1SKHNU}
>  │  └─sdd: [8:48] MD raid5 (1/4) 931.51g md0 clean in_sync
> {daf06d5a-b805-28b1-2e29-483df114274d}
>  ├─scsi 4:3:0:0 ATA Hitachi HDS72101 {JP9960HZ1VK96U}
>  │  └─sde: [8:64] MD raid5 (none/4) 931.51g md_d0 inactive spare
> {daf06d5a-b805-28b1-2e29-483df114274d}
>  │     └─md_d0: [254:0] Empty/Unknown 0.00k
>  └─scsi 7:x:x:x [Empty]
>

Here is something interesting from syslog:

1. I stop /dev/md_d0
Jul 25 15:38:56 localhost kernel: [ 4272.658244] md: md_d0 stopped.
Jul 25 15:38:56 localhost kernel: [ 4272.658258] md: unbind<sde>
Jul 25 15:38:56 localhost kernel: [ 4272.658271] md: export_rdev(sde)

2. I assemble /dev/md0 with:
# mdadm -A /dev/md0 /dev/sd[bcde]
mdadm: /dev/md0 has been started with 3 drives (out of 4).

Jul 25 15:41:33 localhost kernel: [ 4429.537035] md: md0 stopped.
Jul 25 15:41:33 localhost kernel: [ 4429.545447] md: bind<sde>
Jul 25 15:41:33 localhost kernel: [ 4429.545644] md: bind<sdc>
Jul 25 15:41:33 localhost kernel: [ 4429.545810] md: bind<sdb>
Jul 25 15:41:33 localhost kernel: [ 4429.546827] md: bind<sdd>
Jul 25 15:41:33 localhost kernel: [ 4429.546876] md: kicking non-fresh
sde from array!
Jul 25 15:41:33 localhost kernel: [ 4429.546883] md: unbind<sde>
Jul 25 15:41:33 localhost kernel: [ 4429.546890] md: export_rdev(sde)
Jul 25 15:41:33 localhost kernel: [ 4429.565035] md/raid:md0: device
sdd operational as raid disk 1
Jul 25 15:41:33 localhost kernel: [ 4429.565041] md/raid:md0: device
sdb operational as raid disk 3
Jul 25 15:41:33 localhost kernel: [ 4429.565045] md/raid:md0: device
sdc operational as raid disk 2
Jul 25 15:41:33 localhost kernel: [ 4429.565631] md/raid:md0: allocated 4222kB
Jul 25 15:41:33 localhost kernel: [ 4429.573438] md/raid:md0: raid
level 5 active with 3 out of 4 devices, algorithm 2
Jul 25 15:41:33 localhost kernel: [ 4429.574754] RAID conf printout:
Jul 25 15:41:33 localhost kernel: [ 4429.574757]  --- level:5 rd:4 wd:3
Jul 25 15:41:33 localhost kernel: [ 4429.574761]  disk 1, o:1, dev:sdd
Jul 25 15:41:33 localhost kernel: [ 4429.574765]  disk 2, o:1, dev:sdc
Jul 25 15:41:33 localhost kernel: [ 4429.574768]  disk 3, o:1, dev:sdb
Jul 25 15:41:33 localhost kernel: [ 4429.574863] md0: detected
capacity change from 0 to 2250468753408
Jul 25 15:41:33 localhost kernel: [ 4429.575092]  md0: unknown partition table
Jul 25 15:41:33 localhost kernel: [ 4429.626140] md: bind<sde>

So /dev/sde is "non-fresh" and has an unknown partition table.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html