Re: Help / advice RAID5 Inactive

jonathan.p.milton@xxxxxxxxx · Fri, 03 Aug 2018 07:53:44 +0100

Looks like my third drive (/dev/sdd) has completely died. Not even
showing up in /dev following a reboot. I'll have to investigate and see
if this is just a connector issue or the drive really is dead. Very
disappointing if it is given the POH is ~2.5 years and smartctl is
reporting passes on all tests. 

I have 4 SATA ports. I use three for the RAID and one for boot /
management stuff (which I'm glad of right now!). But it means I can't
go to RAID6 without the SATA expansion card you linked before. 

This problem has come about because I failed to notice the first drive
dropping out. I've been running OSSEC the last few months, I think that
would have detected the problem and emailed me.

Cheers

On Thu, 2018-08-02 at 23:16 +0100, Wols Lists wrote:
> On 02/08/18 20:11, jonathan.p.milton@xxxxxxxxx wrote:
> > Hi Wol,
> > 
> > Thanks for your reply. I run BackupPC from a separate host so have good backups, so I'm not sweating too much !
> > 
> > I think you're right about drive 2 being bumped a while ago. That's would make sense with the counts. My bad having no error reporting enabled to alert me. Very disappointed though, these are Samsung drives and only 2 years old.
> > 
> 
> It's supposedly being updated - you can configure xosview to display
> raid status although last I looked it wasn't working ... when it is, get
> it to start automatically on your desktop :-) That way, you'll see
> instantly there's a problem. I get twitchy if xosview isn't running in
> the background telling me what's going on :-)
> 
> > Given I have backups I went for the --force option and am happy to report it all went smoothly.
> > 
> 
> Good good good ...
> 
> > I am not seeing any evidence of rebuild, which is a surprise.
> > >  # cat /proc/mdstat 
> > >  Personalities : [raid6] [raid5] [raid4] 
> > >  md0 : active raid5 sda1[0] sdd1[3] sdc1[1]
> > >        3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2   [3/3] [UUU]
> > >     bitmap: 0/15 pages [0KB], 65536KB chunk
> > > unused devices: <none>
> 
> Did you force assemble all three drives? BAD idea!
> > 
> > The raw device was encrypted. No problem with luksOpen
> > 
> > Now running xfs_repair on opne of the logical volumes. Looks like I have some data loss but it is minor. Fortunately server has been sitting idle for a couple of weeks due to vacation.
> > 
> > What you think about there being no rebuild?  
> 
> --force resets the event count. So the raid wouldn't see any need for a
> repair. That could easily explain your corruption. If xfs is fixing and
> rewriting damaged files, that's great, but I wouldn't trust it - I
> wouldn't trust the underlying raid to recover properly if all three
> drives had been forced. Anything xfs recovers, you should restore from
> backup.
> 
> And then do a raid "repair". That'll fix all the parities, and clean up
> the raid.
> 
> Basically, assuming you forced all three drives, what happened is that
> the raid code assumed you now have a clean array. Which means that all
> the files written since the first drive dropped out are suspect - those
> where the two working drives were data drives will be okay since raid
> preferentially assumes the parity is suspect, but if the raid tried to
> put the data on the disk that dropped out, any file using that space
> will now consist half of valid data, and half of whatever garbage is on
> the now-restored disk. The valid parity will be ignored, giving you
> corrupt data :-( (raid-5 does NOT allow you to successfully rebuild if
> your array is corrupted, it only allows you to rebuild a disk that has
> gone awol).
> 
> Assuming xfs and a restore from backup have recovered everything, that's
> great for you, but I think you need to read the wiki. I get the
> impression you don't fully understand how raid works, and I think you've
> actually caused yourself a load of unnecessary work. Next time it could
> cost you the array ... :-(
> 
> As for "why did the first drive get booted off?", you can get read
> errors for pretty much any reason. Because you didn't have ERC
> configured, the drive firmware could have barfed for any random reason,
> and bang the drive gets kicked. Whoops! I'm guessing the array fell
> over, because a second drive also barfed for some reason.
> 
> Oh - and if you go to raid-6, which you should, this will allow you to
> recover from this sort of situation, but you need to use the right tools
> - yet more reason to make sure you actually understand what's going on
> "under the bonnet".
> 
> Cheers,
> Wol
> > 
> > Cheers
> > 
> > Jonathan
> > 
> > 
> > On Thu, 2018-08-02 at 19:54 +0100, Wols Lists wrote:
> > > On 02/08/18 10:13, Jonathan Milton wrote:
> > > > Hi,
> > > > 
> > > > Overnight my server had problems with its RAID5 (xfs corrupt inodes), on
> > > > reboot the raid comes up inactive.
> > > > 
> > > > * Smarttools suggest disks (3x2TB) are healthy. I have powered down and
> > > > checked all the SATA leads are still plugged correctly.
> > > > 
> > > > * MDADM is unable to assembled to raid from 1 drive:
> > > > # mdadm --assemble /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1
> > > > mdadm: /dev/md0 assembled from 1 drive - not enough to start the array.
> > > > 
> > > > * Event counts are well off on one drive ( 290391/182871/290391)
> > > 
> > > Not good.
> > > > 
> > > > * SCT Error Recovery Control was disabled on all drives prior to this
> > > > failure but I have since modified the boot scripts to set to 7s as per
> > > > the wiki (no improvement)
> > > > 
> > > > I am considering whether to try --force and would like advice from
> > > > experts first
> > > > 
> > > 
> > > NOT WITHOUT A BACKUP!
> > > > 
> > > > Thanks in advance
> > > > 
> > > 
> > > That "only one drive" bothers me. Have you got any spare drives? Have
> > > you any spare SATA ports to upgrade to raid-6?
> > > 
> > > I'd ddrescue the two drives with the highest count (is that sda and
> > > sdd?), then force assemble the copies. That stands a good chance of
> > > succeeding. If that works, you can add back the third drive to recover
> > > your raid-5 - keeping the original two as a temporary backup.
> > > 
> > > If you can't get spare drives, overlay the two good drives then see if a
> > > force gets you a working array. If it does, then you can try it without
> > > the overlay, but not having a backup increases the risk ...
> > > 
> > > Then add one of the original drives back to convert to raid-6.
> > > 
> > > The event counts make me suspect the middle drive got booted long ago
> > > for some reason, then you've had a hiccup that booted a second drive.
> > > Quite likely if you didn't have ERC enabled. So it does look like an
> > > easy fix but because you've effectively got a broken raid-0 at present,
> > > the risk to your data from any further problem is HIGH. Read
> > > 
> > > https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
> > > 
> > > 
> > > If you don't have any spare SATA ports, go and buy something like
> > > 
> > > https://www.amazon.co.uk/dp/B00952N2DQ/ref=twister_B01DUJJZ8U?_encoding=UTF8&th=1
> > > 
> > > You want a card with one SATA *and* one eSATA - beware - I think most of
> > > these have a jumper to switch between SATA *or* eSATA so you'll want a
> > > card that claims two of each - it will only actually drive two sata
> > > devices so configure one port for SATA for your raid-6, and one for
> > > eSATA so you can temporarily add external disks ...
> > > 
> > > https://www.amazon.co.uk/iDsonix-SuperSpeed-Docking-Station-Free-Black/dp/B00L3W0F40/ref=sr_1_1?ie=UTF8&qid=1529780418&sr=8-1&keywords=eSATA%2Bdisk%2Bdocking%2Bstation&th=1
> > > 
> > > Not sure whether you can connect this with an eSATA port-multiplier
> > > cable - do NOT run raid over the USB connection !!!
> > > 
> > > Cheers,
> > > Wol
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html