Determining cause of md RAID 'recovery interrupted'

esqychd2f5@xxxxxxxxxxxxxx · Sat, 16 Jul 2022 21:17:25 +0000

Hi,

I'm a long-time md raid user, and a big fan of the project.  I have run into an issue that I haven't been able to track down a solution to online.

I have an md raid array using 12TB Seagate Iron Wolf NAS drives in a RAID6 configuration.  This array grew from 4 drives to 10 drives over several years, and after the restripe to 10 drives it started occasionally dropping drives without obvious errors (no read or write issues).

The server is running Ubuntu 20.04.4 LTS (fully updated) and the drives are connected using LSI SAS 9207-8i adapters.

The dropping of drives has led to the array now being in a degraded state, and I can't get it to rebuild.  It fails with a 'recovery interrupted' message. It did rebuild successfully a few times, but now fails consistently at the same point around 12% done.

I have confirmed that I can read all data from all of my drives using the 'badblocks' tool to read all data from all drives.  No read errors are reported.

The rebuild start up to failure looks like this in the system log:
[  715.210403] md: md3 stopped.
[  715.447441] md/raid:md3: device sdd operational as raid disk 1
[  715.447443] md/raid:md3: device sdp operational as raid disk 9
[  715.447444] md/raid:md3: device sdc operational as raid disk 7
[  715.447445] md/raid:md3: device sdb operational as raid disk 6
[  715.447446] md/raid:md3: device sdm operational as raid disk 5
[  715.447447] md/raid:md3: device sdn operational as raid disk 4
[  715.447448] md/raid:md3: device sdq operational as raid disk 3
[  715.447449] md/raid:md3: device sdo operational as raid disk 2
[  715.451780] md/raid:md3: raid level 6 active with 8 out of 10 devices, algorithm 2
[  715.451839] md3: detected capacity change from 0 to 96000035258368
[  715.452035] md: recovery of RAID array md3
[  715.674492]  md3: p1
[ 9803.487218] md: md3: recovery interrupted.

I have the technical data about the drive, but it is very large (181K) so I'll post it as a response to this post to minimize clutter.
There are a few md RAID arrays shown in the logs, the one with the problem is 'md3'.

Initially, I'd like to figure out why the rebuild gets interrupted (later I will look into why drives are being dropped).  I would expect an error message explaining the interruption, but I haven't been able to find it.  Maybe it is in an unexpected system log file?

One thing I notice is that one of my drives (/dev/sdc) has 'Bad Blocks Present':
  Bad Block Log : 512 entries available at offset 264 sectors - bad blocks present.

So, a few questions:

- Would the 'Bad Blocks Present' for sdc lead to 'recovery interrupted'?
- More generally, how do I find out what has interrupted the rebuild?

Thanks in advance for your help!

Joe