Re: Determining cause of md RAID 'recovery interrupted'

Roy Sigurd Karlsbakk <roy@xxxxxxxxxxxxx> · Sat, 16 Jul 2022 23:44:19 +0200 (CEST)

hi

Could you check for Current_Pending_Sector and Reallocated_Sector_Ct for the drives in the array? You'll find this with smartctl -a /dev/sdX. These should be zero, but a few errors won't sink the ship. Also, check if there is a populated badblocks list on either of the drives. I've written a bit about these here https://wiki.karlsbakk.net/index.php?title=Roy%27s_notes#The_badblock_list. There's also https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy for more info.

Vennlig hilsen

roy
-- 
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.

----- Original Message -----
> From: esqychd2f5@xxxxxxxxxxxxxx
> To: "Linux Raid" <linux-raid@xxxxxxxxxxxxxxx>
> Sent: Saturday, 16 July, 2022 23:17:25
> Subject: Determining cause of md RAID 'recovery interrupted'

> Hi,
> 
> I'm a long-time md raid user, and a big fan of the project.  I have run into an
> issue that I haven't been able to track down a solution to online.
> 
> I have an md raid array using 12TB Seagate Iron Wolf NAS drives in a RAID6
> configuration.  This array grew from 4 drives to 10 drives over several years,
> and after the restripe to 10 drives it started occasionally dropping drives
> without obvious errors (no read or write issues).
> 
> The server is running Ubuntu 20.04.4 LTS (fully updated) and the drives are
> connected using LSI SAS 9207-8i adapters.
> 
> The dropping of drives has led to the array now being in a degraded state, and I
> can't get it to rebuild.  It fails with a 'recovery interrupted' message. It
> did rebuild successfully a few times, but now fails consistently at the same
> point around 12% done.
> 
> I have confirmed that I can read all data from all of my drives using the
> 'badblocks' tool to read all data from all drives.  No read errors are
> reported.
> 
> The rebuild start up to failure looks like this in the system log:
> [  715.210403] md: md3 stopped.
> [  715.447441] md/raid:md3: device sdd operational as raid disk 1
> [  715.447443] md/raid:md3: device sdp operational as raid disk 9
> [  715.447444] md/raid:md3: device sdc operational as raid disk 7
> [  715.447445] md/raid:md3: device sdb operational as raid disk 6
> [  715.447446] md/raid:md3: device sdm operational as raid disk 5
> [  715.447447] md/raid:md3: device sdn operational as raid disk 4
> [  715.447448] md/raid:md3: device sdq operational as raid disk 3
> [  715.447449] md/raid:md3: device sdo operational as raid disk 2
> [  715.451780] md/raid:md3: raid level 6 active with 8 out of 10 devices,
> algorithm 2
> [  715.451839] md3: detected capacity change from 0 to 96000035258368
> [  715.452035] md: recovery of RAID array md3
> [  715.674492]  md3: p1
> [ 9803.487218] md: md3: recovery interrupted.
> 
> I have the technical data about the drive, but it is very large (181K) so I'll
> post it as a response to this post to minimize clutter.
> There are a few md RAID arrays shown in the logs, the one with the problem is
> 'md3'.
> 
> Initially, I'd like to figure out why the rebuild gets interrupted (later I will
> look into why drives are being dropped).  I would expect an error message
> explaining the interruption, but I haven't been able to find it.  Maybe it is
> in an unexpected system log file?
> 
> One thing I notice is that one of my drives (/dev/sdc) has 'Bad Blocks Present':
>  Bad Block Log : 512 entries available at offset 264 sectors - bad blocks
>  present.
> 
> So, a few questions:
> 
> - Would the 'Bad Blocks Present' for sdc lead to 'recovery interrupted'?
> - More generally, how do I find out what has interrupted the rebuild?
> 
> Thanks in advance for your help!
> 
> Joe