Re: 2 Disks Jumped Out While Reshaping RAID5

"Majed B." <majedb@xxxxxxxxx> · Mon, 7 Sep 2009 19:34:18 +0300

A little update on the situation:

After uninstalling mdadm 2.6.7.1 which ships with Ubuntu 9.04, and
installing mdadm 3.0, I got this:

root@Adam:~# cat /proc/mdstat
Personalities :
unused devices: <none>

I'm guessing that happened because initram tools was removed when
uninstalling the old mdadm. No problem, I'll just assemble the array
on boot (through a line in /etc/rc.local).

I then proceeded to assemble the array, but it refused:
root@Adam:~# mdadm -Af --verbose /dev/md0
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sdi5: Device or resource busy
mdadm: /dev/sdi5 has wrong uuid.
mdadm: no recogniseable superblock on /dev/sdi2
mdadm: /dev/sdi2 has wrong uuid.
mdadm: cannot open device /dev/sdi1: Device or resource busy
mdadm: /dev/sdi1 has wrong uuid.
mdadm: cannot open device /dev/sdi: Device or resource busy
mdadm: /dev/sdi has wrong uuid.
mdadm: no RAID superblock on /dev/sdh
mdadm: /dev/sdh has wrong uuid.
mdadm: superblock on /dev/sdg1 doesn't match others - assembly aborted

Since sdg1 has flunked out before, I just zeroed its superblock to add
it later, if it wasn't dead:

root@Adam:~# mdadm --zero-superblock /dev/sdg
mdadm: Unrecognised md component device - /dev/sdg
root@Adam:~# mdadm --zero-superblock /dev/sdg1
root@Adam:~# mdadm --zero-superblock /dev/sdg1
mdadm: Unrecognised md component device - /dev/sdg1

The array assembled properly after that (with 7 out 8 disks -- running
degraded):
root@Adam:~# mdadm -Af --verbose /dev/md0
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sdi5: Device or resource busy
mdadm: /dev/sdi5 has wrong uuid.
mdadm: no recogniseable superblock on /dev/sdi2
mdadm: /dev/sdi2 has wrong uuid.
mdadm: cannot open device /dev/sdi1: Device or resource busy
mdadm: /dev/sdi1 has wrong uuid.
mdadm: cannot open device /dev/sdi: Device or resource busy
mdadm: /dev/sdi has wrong uuid.
mdadm: no RAID superblock on /dev/sdh
mdadm: /dev/sdh has wrong uuid.
mdadm: no RAID superblock on /dev/sdg1
mdadm: /dev/sdg1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdg
mdadm: /dev/sdg has wrong uuid.
mdadm: no RAID superblock on /dev/sdf
mdadm: /dev/sdf has wrong uuid.
mdadm: no RAID superblock on /dev/sde
mdadm: /dev/sde has wrong uuid.
mdadm: no RAID superblock on /dev/sdd
mdadm: /dev/sdd has wrong uuid.
mdadm: no RAID superblock on /dev/sdc
mdadm: /dev/sdc has wrong uuid.
mdadm: no RAID superblock on /dev/sdb
mdadm: /dev/sdb has wrong uuid.
mdadm: no RAID superblock on /dev/sda
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 7.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 6.
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 4.
mdadm: added /dev/sde1 to /dev/md0 as 1
mdadm: added /dev/sdc1 to /dev/md0 as 2
mdadm: no uptodate device for slot 3 of /dev/md0
mdadm: added /dev/sda1 to /dev/md0 as 4
mdadm: added /dev/sdh1 to /dev/md0 as 5
mdadm: added /dev/sdb1 to /dev/md0 as 6
mdadm: added /dev/sdd1 to /dev/md0 as 7
mdadm: added /dev/sdf1 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 7 drives (out of 8).
root@Adam:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdf1[0] sdd1[7] sdb1[6] sdh1[5] sda1[4] sdc1[2] sde1[1]
      6837318656 blocks level 5, 256k chunk, algorithm 2 [8/7] [UUU_UUUU]

unused devices: <none>

After some poking, I'm suspecting the MSI motherboard itself, since
the problems happens to disks that are on ports 7 and 8 on the
motherboard, and those two ports have their own controller and they
share a single bus.

I've ordered an EVGA motherboard that should arrive in a week or so.
I'll update later when I move the hard disks to it and add that sdg
disk.

Thanks again Neil for your help :)

On Mon, Sep 7, 2009 at 3:44 AM, Majed B.<majedb@xxxxxxxxx> wrote:
> Thanks a lot Neil for your help :)
>
> kernel logs showed a SATA link error for sdg. I double checked the
> cables and they were more than fine and the array was running for
> weeks before I did the reshaping and no errors were reported before
> the reshaping process.
>
> I'm using an MSI motherboard (MS-7514) and been having random issues
> with it since reaching 6 disks. I've recently ordered an EVGA
> motherboard and if things turn to be stable on it, I'll ditch MSI for
> good.
>
> Throughout searching for the past 6 days, I noticed people complaining
> from acpi and apic causing issues, so I turned them off and will see
> how things turn out.
>
> These are the hard disks I'm using:
>
> root@Adam:~# hddtemp /dev/sd[a-h]
> /dev/sda: WDC WD10EACS-00D6B1: 26°C
> /dev/sdb: WDC WD10EACS-00D6B1: 28°C
> /dev/sdc: WDC WD10EACS-00ZJB0: 29°C
> /dev/sdd: WDC WD10EADS-65L5B1: 27°C
> /dev/sde: WDC WD10EADS-65L5B1: 28°C
> /dev/sdf: MAXTOR STM31000340AS: 28°C
> /dev/sdg: WDC WD10EACS-00ZJB0: 26°C
> /dev/sdh: WDC WD10EADS-00L5B1: 25°C
> /dev/sdi: Hitachi HDS721680PLAT80: 32°C
>
> (sdi is the OS disk)
>
> Neil, do you suggest any certain test/stress-tests to put sdg through?
>
> I'll force a couple of short and long smartd tests on it, and have dd
> read the whole disk a couple of times to make sure all sectors are
> read properly. Is that sufficient?
>
> Thank you again.
>
> On Mon, Sep 7, 2009 at 3:31 AM, NeilBrown<neilb@xxxxxxx> wrote:
>> On Mon, September 7, 2009 10:01 am, Majed B. wrote:
>>> I have installed mdadm 3.0 and ran -Af and now it's continuing
>>> reshaping!!!
>>
>> Excellent.
>>
>> Based on the --examine info you provided it appears that
>> /dev/sdg1 reported an error at about 00:10:39 on Wednesday morning
>> and was evicted from the array.  Reshape was up to 2435GB (37%) at
>> that point.
>> Reshape continued until 06:40:04 that morning at which point it
>> had reached 3201GB (49%).  At that point /dev/sdf1 seems to have
>> reported an error so the whole array went off line.
>>
>> When you reassembled with mdadm-3.0 and --force, it excluded sdg1
>> as that was the oldest, and marked sdf1 as up-to-date, and continued.
>>
>> The reshape processes will have redone the last few chunks so all
>> the data will have been properly relocated.
>>
>> As all the superblocks report that the array was "State : clean",
>> you can be quite sure that all your data is safe (if they were
>> "State : active" there would be a small chance some a block or two
>> was corrupted and a fsck etc would be advised).
>>
>> It wouldn't hurt to examine your kernel logs to see what sort of
>> error was tiggered at those two times in case there might be a need
>> to replace a device.
>>
>>
>>
>>
>>> sdg1 is not in the list. Is that correct?!  sdg1 was one of the
>>> array's disks before expanding. So I guess now the array is degraded
>>> yet is reshaping as if it had 8 disks, correct?
>>
>> Yes, that is correct.
>> It may be that sdg has a transient error, or it may have a serious
>> media or other error.  You should convince yourself that it is working
>> reliably before adding it back in to the array.
>>
>>
>>
>>>
>>> So after the reshaping process is over, I can add sdg1 again and it
>>> will resync properly, right?
>>
>> Yes it will, providing no write-errors occur while writing data to it.
>>
>> NeilBrown
>>
>>
>
>
>
> --
>       Majed B.
>

-- 
       Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html