Looks like my third drive (/dev/sdd) has completely died. Not even showing up in /dev following a reboot. I'll have to investigate and see if this is just a connector issue or the drive really is dead. Very disappointing if it is given the POH is ~2.5 years and smartctl is reporting passes on all tests. I have 4 SATA ports. I use three for the RAID and one for boot / management stuff (which I'm glad of right now!). But it means I can't go to RAID6 without the SATA expansion card you linked before. This problem has come about because I failed to notice the first drive dropping out. I've been running OSSEC the last few months, I think that would have detected the problem and emailed me. Cheers On Thu, 2018-08-02 at 23:16 +0100, Wols Lists wrote: > On 02/08/18 20:11, jonathan.p.milton@xxxxxxxxx wrote: > > Hi Wol, > > > > Thanks for your reply. I run BackupPC from a separate host so have good backups, so I'm not sweating too much ! > > > > I think you're right about drive 2 being bumped a while ago. That's would make sense with the counts. My bad having no error reporting enabled to alert me. Very disappointed though, these are Samsung drives and only 2 years old. > > > > It's supposedly being updated - you can configure xosview to display > raid status although last I looked it wasn't working ... when it is, get > it to start automatically on your desktop :-) That way, you'll see > instantly there's a problem. I get twitchy if xosview isn't running in > the background telling me what's going on :-) > > > Given I have backups I went for the --force option and am happy to report it all went smoothly. > > > > Good good good ... > > > I am not seeing any evidence of rebuild, which is a surprise. > > > # cat /proc/mdstat > > > Personalities : [raid6] [raid5] [raid4] > > > md0 : active raid5 sda1[0] sdd1[3] sdc1[1] > > > 3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] > > > bitmap: 0/15 pages [0KB], 65536KB chunk > > > unused devices: <none> > > Did you force assemble all three drives? BAD idea! > > > > The raw device was encrypted. No problem with luksOpen > > > > Now running xfs_repair on opne of the logical volumes. Looks like I have some data loss but it is minor. Fortunately server has been sitting idle for a couple of weeks due to vacation. > > > > What you think about there being no rebuild? > > --force resets the event count. So the raid wouldn't see any need for a > repair. That could easily explain your corruption. If xfs is fixing and > rewriting damaged files, that's great, but I wouldn't trust it - I > wouldn't trust the underlying raid to recover properly if all three > drives had been forced. Anything xfs recovers, you should restore from > backup. > > And then do a raid "repair". That'll fix all the parities, and clean up > the raid. > > Basically, assuming you forced all three drives, what happened is that > the raid code assumed you now have a clean array. Which means that all > the files written since the first drive dropped out are suspect - those > where the two working drives were data drives will be okay since raid > preferentially assumes the parity is suspect, but if the raid tried to > put the data on the disk that dropped out, any file using that space > will now consist half of valid data, and half of whatever garbage is on > the now-restored disk. The valid parity will be ignored, giving you > corrupt data :-( (raid-5 does NOT allow you to successfully rebuild if > your array is corrupted, it only allows you to rebuild a disk that has > gone awol). > > Assuming xfs and a restore from backup have recovered everything, that's > great for you, but I think you need to read the wiki. I get the > impression you don't fully understand how raid works, and I think you've > actually caused yourself a load of unnecessary work. Next time it could > cost you the array ... :-( > > As for "why did the first drive get booted off?", you can get read > errors for pretty much any reason. Because you didn't have ERC > configured, the drive firmware could have barfed for any random reason, > and bang the drive gets kicked. Whoops! I'm guessing the array fell > over, because a second drive also barfed for some reason. > > Oh - and if you go to raid-6, which you should, this will allow you to > recover from this sort of situation, but you need to use the right tools > - yet more reason to make sure you actually understand what's going on > "under the bonnet". > > Cheers, > Wol > > > > Cheers > > > > Jonathan > > > > > > On Thu, 2018-08-02 at 19:54 +0100, Wols Lists wrote: > > > On 02/08/18 10:13, Jonathan Milton wrote: > > > > Hi, > > > > > > > > Overnight my server had problems with its RAID5 (xfs corrupt inodes), on > > > > reboot the raid comes up inactive. > > > > > > > > * Smarttools suggest disks (3x2TB) are healthy. I have powered down and > > > > checked all the SATA leads are still plugged correctly. > > > > > > > > * MDADM is unable to assembled to raid from 1 drive: > > > > # mdadm --assemble /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1 > > > > mdadm: /dev/md0 assembled from 1 drive - not enough to start the array. > > > > > > > > * Event counts are well off on one drive ( 290391/182871/290391) > > > > > > Not good. > > > > > > > > * SCT Error Recovery Control was disabled on all drives prior to this > > > > failure but I have since modified the boot scripts to set to 7s as per > > > > the wiki (no improvement) > > > > > > > > I am considering whether to try --force and would like advice from > > > > experts first > > > > > > > > > > NOT WITHOUT A BACKUP! > > > > > > > > Thanks in advance > > > > > > > > > > That "only one drive" bothers me. Have you got any spare drives? Have > > > you any spare SATA ports to upgrade to raid-6? > > > > > > I'd ddrescue the two drives with the highest count (is that sda and > > > sdd?), then force assemble the copies. That stands a good chance of > > > succeeding. If that works, you can add back the third drive to recover > > > your raid-5 - keeping the original two as a temporary backup. > > > > > > If you can't get spare drives, overlay the two good drives then see if a > > > force gets you a working array. If it does, then you can try it without > > > the overlay, but not having a backup increases the risk ... > > > > > > Then add one of the original drives back to convert to raid-6. > > > > > > The event counts make me suspect the middle drive got booted long ago > > > for some reason, then you've had a hiccup that booted a second drive. > > > Quite likely if you didn't have ERC enabled. So it does look like an > > > easy fix but because you've effectively got a broken raid-0 at present, > > > the risk to your data from any further problem is HIGH. Read > > > > > > https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn > > > > > > > > > If you don't have any spare SATA ports, go and buy something like > > > > > > https://www.amazon.co.uk/dp/B00952N2DQ/ref=twister_B01DUJJZ8U?_encoding=UTF8&th=1 > > > > > > You want a card with one SATA *and* one eSATA - beware - I think most of > > > these have a jumper to switch between SATA *or* eSATA so you'll want a > > > card that claims two of each - it will only actually drive two sata > > > devices so configure one port for SATA for your raid-6, and one for > > > eSATA so you can temporarily add external disks ... > > > > > > https://www.amazon.co.uk/iDsonix-SuperSpeed-Docking-Station-Free-Black/dp/B00L3W0F40/ref=sr_1_1?ie=UTF8&qid=1529780418&sr=8-1&keywords=eSATA%2Bdisk%2Bdocking%2Bstation&th=1 > > > > > > Not sure whether you can connect this with an eSATA port-multiplier > > > cable - do NOT run raid over the USB connection !!! > > > > > > Cheers, > > > Wol > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html