Re: RAID1, changed disk, 2nd has errors ...

Robin Hill <robin@xxxxxxxxxxxxxxx> · Fri, 26 Aug 2011 15:08:10 +0100



On Fri Aug 26, 2011 at 03:51:17PM +0200, Stefan G. Weichinger wrote:

> Am 26.08.2011 14:56, schrieb Robin Hill:
> 
> > This would indicate that sdb has been reset as a spare, suggesting 
> > that the resync failed so it has left sda alone in the array (as 
> > failing it would destroy the array).
> 
> oh my ...
> 
> So the array is somehow split-brain now? Some sectors good here, some
> there??
> 
> Why is sda4 now flagged as (S)? Is it a spare or not?
> I don't fully understand the current state of the array ...
> 
sda4 is still in the array, with some unreadable sectors. sdb4 is a
spare because the resync failed due to unreadable sectors on sda4. You
cannot add a disk to an array unless the data can all be read (or
recovered if there's still enough redundancy).

> > I'd suggest stopping the array and using ddrescue to clone sda4 to 
> > sdb4. That'll copy everything possible, flagging up any read
> > issues. You'll then need to run a "fsck -f" on sdb4 to clear up
> > any filesystem damage. You may still be left with damaged/missing
> > files, depending on where any read errors occurred. How critical
> > this is will depend on what the filesystem is used for (and whether
> > you have any backup).
> 
> I am rather scared to do so ... as I am ~50kms away from the box now,
> and as it seems to be working fine so far (though there are currently
> no users working with it).
> 
It'll work fine unless something attempts to read from any of the
unreadable sectors on sda4. If these are not used by the filesystem
currently, then you may never run into an issue (as they'll get remapped
if a write error occurs when they do get used).

> As mentioned /dev/md2 doesn't contain a filesystem itself, but is the
> single PV in a LVM-volumegroup.
> 
> This group contains 6 logical volumes ...
> 
> As far as I understand it might be possible to spot the defective
> sectors and the related LV?
> 
A read of the relevant block device (dd if=/dev/xxx of=/dev/null) will
result in read errors for whichever block device contains the bad
sectors. You could also probably map the sectors reported by the kernel
to the position on the disk to tell what LV it.

> I have backups, yes ...
> 
In which case the absolute safest option is just to recreate whatever
arrays, PVs, LVs, etc. on sdb4 and restore the data, ignoring whatever's
on sda4 currently.

> > If that all works okay, then get sda replaced and give it a
> > thorough badblocks and SMART test.
> > 
> > I'd also advise setting up regular array checks (echo check > 
> > /sys/block/mdX/md/sync_action) to make sure the disks are checked 
> > and any unreadable blocks repaired/mapped out _before_ they're
> > needed for recovery.
> 
> re-adding sda4 and starting such a check would be possible?
> Or would a re-add damage things?
> 
You can't add sda4 because it's already in the array.

> Should I shutdown the box for safety?
> 
For absolute safety, yes, though I don't think the risk is too high at
the moment, and I don't think things'll get any worse in the short term.

> I am really feeling unsafe now, and getting another hdd for swapping
> will take me at least until monday.
> 
> (I would like to dd-rescue to another new disk to keep sdb, just in case)
> 
I doubt you'd be able to recover anything useful from sdb4 at the
moment, but that's up to you.


-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
pgpAFojmfSkXL.pgp

Description: PGP signature