Re: Help / advice RAID5 Inactive

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Thu, 2 Aug 2018 23:16:53 +0100

On 02/08/18 20:11, jonathan.p.milton@xxxxxxxxx wrote:
> Hi Wol,
> 
> Thanks for your reply. I run BackupPC from a separate host so have good backups, so I'm not sweating too much !
> 
> I think you're right about drive 2 being bumped a while ago. That's would make sense with the counts. My bad having no error reporting enabled to alert me. Very disappointed though, these are Samsung drives and only 2 years old.
> 
It's supposedly being updated - you can configure xosview to display
raid status although last I looked it wasn't working ... when it is, get
it to start automatically on your desktop :-) That way, you'll see
instantly there's a problem. I get twitchy if xosview isn't running in
the background telling me what's going on :-)

> Given I have backups I went for the --force option and am happy to report it all went smoothly.
> 
Good good good ...

> I am not seeing any evidence of rebuild, which is a surprise.
> |  # cat /proc/mdstat 
> |  Personalities : [raid6] [raid5] [raid4] 
> |  md0 : active raid5 sda1[0] sdd1[3] sdc1[1]
> |        3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2   [3/3] [UUU]
> |     bitmap: 0/15 pages [0KB], 65536KB chunk
> | unused devices: <none>

Did you force assemble all three drives? BAD idea!
> 
> The raw device was encrypted. No problem with luksOpen
> 
> Now running xfs_repair on opne of the logical volumes. Looks like I have some data loss but it is minor. Fortunately server has been sitting idle for a couple of weeks due to vacation.
> 
> What you think about there being no rebuild?  

--force resets the event count. So the raid wouldn't see any need for a
repair. That could easily explain your corruption. If xfs is fixing and
rewriting damaged files, that's great, but I wouldn't trust it - I
wouldn't trust the underlying raid to recover properly if all three
drives had been forced. Anything xfs recovers, you should restore from
backup.

And then do a raid "repair". That'll fix all the parities, and clean up
the raid.

Basically, assuming you forced all three drives, what happened is that
the raid code assumed you now have a clean array. Which means that all
the files written since the first drive dropped out are suspect - those
where the two working drives were data drives will be okay since raid
preferentially assumes the parity is suspect, but if the raid tried to
put the data on the disk that dropped out, any file using that space
will now consist half of valid data, and half of whatever garbage is on
the now-restored disk. The valid parity will be ignored, giving you
corrupt data :-( (raid-5 does NOT allow you to successfully rebuild if
your array is corrupted, it only allows you to rebuild a disk that has
gone awol).

Assuming xfs and a restore from backup have recovered everything, that's
great for you, but I think you need to read the wiki. I get the
impression you don't fully understand how raid works, and I think you've
actually caused yourself a load of unnecessary work. Next time it could
cost you the array ... :-(

As for "why did the first drive get booted off?", you can get read
errors for pretty much any reason. Because you didn't have ERC
configured, the drive firmware could have barfed for any random reason,
and bang the drive gets kicked. Whoops! I'm guessing the array fell
over, because a second drive also barfed for some reason.

Oh - and if you go to raid-6, which you should, this will allow you to
recover from this sort of situation, but you need to use the right tools
- yet more reason to make sure you actually understand what's going on
"under the bonnet".

Cheers,
Wol
> 
> Cheers
> 
> Jonathan
> 
> 
> On Thu, 2018-08-02 at 19:54 +0100, Wols Lists wrote:
>> On 02/08/18 10:13, Jonathan Milton wrote:
>>> Hi,
>>>
>>> Overnight my server had problems with its RAID5 (xfs corrupt inodes), on
>>> reboot the raid comes up inactive.
>>>
>>> * Smarttools suggest disks (3x2TB) are healthy. I have powered down and
>>> checked all the SATA leads are still plugged correctly.
>>>
>>> * MDADM is unable to assembled to raid from 1 drive:
>>> # mdadm --assemble /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1
>>> mdadm: /dev/md0 assembled from 1 drive - not enough to start the array.
>>>
>>> * Event counts are well off on one drive ( 290391/182871/290391)
>>
>> Not good.
>>>
>>> * SCT Error Recovery Control was disabled on all drives prior to this
>>> failure but I have since modified the boot scripts to set to 7s as per
>>> the wiki (no improvement)
>>>
>>> I am considering whether to try --force and would like advice from
>>> experts first
>>>
>>
>> NOT WITHOUT A BACKUP!
>>>
>>> Thanks in advance
>>>
>>
>> That "only one drive" bothers me. Have you got any spare drives? Have
>> you any spare SATA ports to upgrade to raid-6?
>>
>> I'd ddrescue the two drives with the highest count (is that sda and
>> sdd?), then force assemble the copies. That stands a good chance of
>> succeeding. If that works, you can add back the third drive to recover
>> your raid-5 - keeping the original two as a temporary backup.
>>
>> If you can't get spare drives, overlay the two good drives then see if a
>> force gets you a working array. If it does, then you can try it without
>> the overlay, but not having a backup increases the risk ...
>>
>> Then add one of the original drives back to convert to raid-6.
>>
>> The event counts make me suspect the middle drive got booted long ago
>> for some reason, then you've had a hiccup that booted a second drive.
>> Quite likely if you didn't have ERC enabled. So it does look like an
>> easy fix but because you've effectively got a broken raid-0 at present,
>> the risk to your data from any further problem is HIGH. Read
>>
>> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>>
>>
>> If you don't have any spare SATA ports, go and buy something like
>>
>> https://www.amazon.co.uk/dp/B00952N2DQ/ref=twister_B01DUJJZ8U?_encoding=UTF8&th=1
>>
>> You want a card with one SATA *and* one eSATA - beware - I think most of
>> these have a jumper to switch between SATA *or* eSATA so you'll want a
>> card that claims two of each - it will only actually drive two sata
>> devices so configure one port for SATA for your raid-6, and one for
>> eSATA so you can temporarily add external disks ...
>>
>> https://www.amazon.co.uk/iDsonix-SuperSpeed-Docking-Station-Free-Black/dp/B00L3W0F40/ref=sr_1_1?ie=UTF8&qid=1529780418&sr=8-1&keywords=eSATA%2Bdisk%2Bdocking%2Bstation&th=1
>>
>> Not sure whether you can connect this with an eSATA port-multiplier
>> cable - do NOT run raid over the USB connection !!!
>>
>> Cheers,
>> Wol
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html