Re: not enough operational mirrors

Ian Young <ian@xxxxxxxxxxxxxxx> · Sun, 5 Oct 2014 14:43:26 -0700

I've received two replacement drives and added them to the array.  One
of them finished synchronizing and became an active member.  The
other, sdf, has been treated as a spare.  After running a smartctl
test on each of the drives, I found that sde has errors, preventing
the sync process from making sdf an active member.  I have tried a
couple of recommendations I read on various sites, such as stopping
the array and recreating it with the "--assume-clean" option (not
possible because a process is using the array) and growing the array
one disk larger (not possible because this is RAID 10).  Should I try
to repair the bad blocks or is there a way to force sde and sdf to
sync first?

[root@localhost ~]# smartctl -l selftest /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.6.10-4.fc18.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     11822
      1187144704
# 2  Short offline       Completed: read failure       90%     11814
      1187144704

On Tue, Sep 23, 2014 at 10:07 AM, Ian Young <ian@xxxxxxxxxxxxxxx> wrote:
> I booted from a live CD so I could use version 3.1.10 of xfs_repair
> (versions < 3.1.8 reportedly have a bug when using ag_stride), then
> ran the following command:
>
> xfs_repair -P -o bhash=16384 -o ihash=16384 -o ag_stride=16
> /dev/mapper/vg_raid10-srv
>
> It stopped after a few seconds, saying:
>
> xfs_repair: read failed: Input/output error
> XFS: failed to find log head
> zero_log: cannot find log head/tail (xlog_find_tail=5), zeroing it anyway
> xfs_repair: libxfs_device_zero write failed: Input/output error
>
> However, I was able to mount the volume after that and my data was
> still there!  Thanks for pointing me in the right direction with the
> RAID.
>
> On Mon, Sep 22, 2014 at 5:55 PM, Ian Young <ian@xxxxxxxxxxxxxxx> wrote:
>> It's XFS.  I'm running:
>>
>>  xfs_repair -n /dev/mapper/vg_raid10-srv
>>
>> I expect it will take hours or days as this volume is 8.15 TiB.
>>
>> On Mon, Sep 22, 2014 at 4:53 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> On Mon, 22 Sep 2014 10:17:46 -0700 Ian Young <ian@xxxxxxxxxxxxxxx> wrote:
>>>
>>>> I forced the three good disks and the one that was behind by two
>>>> events to assemble:
>>>>
>>>> mdadm --assemble --force /dev/md0 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sde2
>>>>
>>>> Then I added the other two disks and let it sync overnight:
>>>>
>>>> mdadm --add --force /dev/md0 /dev/sdd2
>>>> mdadm --add --force /dev/md0 /dev/sdf2
>>>>
>>>> I rebooted the system in recovery mode and the root filesystem is
>>>> back!  However, / is read-only and my /srv partition, which is the
>>>> largest and has most of my data, can't mount.  When I try to examine
>>>> the array, it says "no md superblock detected on /dev/md0."  On top of
>>>> the software RAID, I have four logical volumes.  Here is the full LVM
>>>> configuration:
>>>>
>>>> http://pastebin.com/gzdZq5DL
>>>>
>>>> How do I recover the superblock?
>>>
>>> What sort of filesystem is it?  ext4??
>>>
>>> Try "fsck -n" and see if it finds anything.
>>>
>>> The fact that LVM found everything suggests that the array is mostly
>>> working.  Maybe just one superblock got corrupted somehow.  If 'fsck' doesn't
>>> get you anywhere you might need to ask on a forum dedicated to the particular
>>> filesystem.
>>>
>>> NeilBrown
>>>
>>>
>>>>
>>>> On Sun, Sep 21, 2014 at 10:47 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>>> > On Sun, 21 Sep 2014 22:32:19 -0700 Ian Young <ian@xxxxxxxxxxxxxxx> wrote:
>>>> >
>>>> >> My 6-drive software RAID 10 array failed.  The individual drives
>>>> >> failed one at a time over the past few months but it's been an
>>>> >> extremely busy summer and I didn't have the free time to RMA the
>>>> >> drives and rebuild the array.  Now I'm wishing I had acted sooner
>>>> >> because three of the drives are marked as removed and the array
>>>> >> doesn't have enough mirrors to start.  I followed the recovery
>>>> >> instructions at raid.wiki.kernel.org and, before making things any
>>>> >> worse, saved the status using mdadm --examine and consulted this
>>>> >> mailing list.  Here's the status:
>>>> >>
>>>> >> http://pastebin.com/KkV8e8Gq
>>>> >>
>>>> >> I can see that the event counts on sdd2 and sdf2 are significantly far
>>>> >> behind, so we can consider that data too old.  sdc2 is only behind by
>>>> >> two events, so any data loss there should be minimal.  If I can make
>>>> >> the array start with sd[abce]2 I think that will be enough to mount
>>>> >> the filesystem, back up my data, and start replacing drives.  How do I
>>>> >> do that?
>>>> >
>>>> > Use the "--force" option with "--assemble".
>>>> >
>>>> > NeilBrown
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html