hi Could you check for Current_Pending_Sector and Reallocated_Sector_Ct for the drives in the array? You'll find this with smartctl -a /dev/sdX. These should be zero, but a few errors won't sink the ship. Also, check if there is a populated badblocks list on either of the drives. I've written a bit about these here https://wiki.karlsbakk.net/index.php?title=Roy%27s_notes#The_badblock_list. There's also https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy for more info. Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. ----- Original Message ----- > From: esqychd2f5@xxxxxxxxxxxxxx > To: "Linux Raid" <linux-raid@xxxxxxxxxxxxxxx> > Sent: Saturday, 16 July, 2022 23:17:25 > Subject: Determining cause of md RAID 'recovery interrupted' > Hi, > > I'm a long-time md raid user, and a big fan of the project. I have run into an > issue that I haven't been able to track down a solution to online. > > I have an md raid array using 12TB Seagate Iron Wolf NAS drives in a RAID6 > configuration. This array grew from 4 drives to 10 drives over several years, > and after the restripe to 10 drives it started occasionally dropping drives > without obvious errors (no read or write issues). > > The server is running Ubuntu 20.04.4 LTS (fully updated) and the drives are > connected using LSI SAS 9207-8i adapters. > > The dropping of drives has led to the array now being in a degraded state, and I > can't get it to rebuild. It fails with a 'recovery interrupted' message. It > did rebuild successfully a few times, but now fails consistently at the same > point around 12% done. > > I have confirmed that I can read all data from all of my drives using the > 'badblocks' tool to read all data from all drives. No read errors are > reported. > > The rebuild start up to failure looks like this in the system log: > [ 715.210403] md: md3 stopped. > [ 715.447441] md/raid:md3: device sdd operational as raid disk 1 > [ 715.447443] md/raid:md3: device sdp operational as raid disk 9 > [ 715.447444] md/raid:md3: device sdc operational as raid disk 7 > [ 715.447445] md/raid:md3: device sdb operational as raid disk 6 > [ 715.447446] md/raid:md3: device sdm operational as raid disk 5 > [ 715.447447] md/raid:md3: device sdn operational as raid disk 4 > [ 715.447448] md/raid:md3: device sdq operational as raid disk 3 > [ 715.447449] md/raid:md3: device sdo operational as raid disk 2 > [ 715.451780] md/raid:md3: raid level 6 active with 8 out of 10 devices, > algorithm 2 > [ 715.451839] md3: detected capacity change from 0 to 96000035258368 > [ 715.452035] md: recovery of RAID array md3 > [ 715.674492] md3: p1 > [ 9803.487218] md: md3: recovery interrupted. > > I have the technical data about the drive, but it is very large (181K) so I'll > post it as a response to this post to minimize clutter. > There are a few md RAID arrays shown in the logs, the one with the problem is > 'md3'. > > Initially, I'd like to figure out why the rebuild gets interrupted (later I will > look into why drives are being dropped). I would expect an error message > explaining the interruption, but I haven't been able to find it. Maybe it is > in an unexpected system log file? > > One thing I notice is that one of my drives (/dev/sdc) has 'Bad Blocks Present': > Bad Block Log : 512 entries available at offset 264 sectors - bad blocks > present. > > So, a few questions: > > - Would the 'Bad Blocks Present' for sdc lead to 'recovery interrupted'? > - More generally, how do I find out what has interrupted the rebuild? > > Thanks in advance for your help! > > Joe