https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
You've obviously covered some of this - glitches with Ubuntu and old
mdadm are common but this clearly isn't that ...
Please give us the information that page asks for - it probably won't
help much, but some things sometimes jump out ...
Cheers,
Wol
On 06/03/2021 11:36, apfc123@xxxxxxxxx wrote:
Hello,
Reshape is currently stalled at 58%. If I reboot the system, the array
is started in auto-read-only mode with resync pending. I can freeze
the sync and mount the filesystem read only to access the data.
When it first stalled and rebooted, I ran extended smart tests on all
18 drives and one came back with read error. I ddrescued it to a new
one (99.99% rescued) and swapped it out, but the reshape still stalls
immediately when setting the array to --readwrite.
The raid5 array started out with 12 drives. I added 6 more drives then
grew it to 18 drives and raid level 6 at the same time. This migration
was started with mdadm 4.1 running on kernel 4.19. I've tried booting
with debian testing running kernel 5.10 (didn't check mdadm version),
and also archlinux running 5.11 but with the same results.
/dev/md0:
Version : 1.2
Creation Time : Tue Dec 17 01:51:38 2019
Raid Level : raid6
Array Size : 42975736320 (40984.86 GiB 44007.15 GB)
Used Dev Size : 3906885120 (3725.90 GiB 4000.65 GB)
Raid Devices : 18
Total Devices : 18
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Mar 6 09:41:58 2021
State : clean, degraded, resyncing (PENDING)
Active Devices : 17
Working Devices : 18
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric-6
Chunk Size : 512K
Consistency Policy : bitmap
Delta Devices : 5, (13->18)
New Layout : left-symmetric
Name : debian:0 (local to host debian)
UUID : 2a0d5568:ea53b429:30df79c9:e7559668
Events : 303246
Number Major Minor RaidDevice State
0 8 129 0 active sync /dev/sdi1
1 8 49 1 active sync /dev/sdd1
2 8 209 2 active sync /dev/sdn1
3 8 1 3 active sync /dev/sda1
4 8 65 4 active sync /dev/sde1
5 8 145 5 active sync /dev/sdj1
6 65 17 6 active sync /dev/sdr1
7 8 113 7 active sync /dev/sdh1
8 65 1 8 active sync /dev/sdq1
9 8 81 9 active sync /dev/sdf1
10 8 17 10 active sync /dev/sdb1
12 8 193 11 active sync /dev/sdm1
18 8 241 12 spare rebuilding /dev/sdp1
17 8 225 13 active sync /dev/sdo1
16 8 177 14 active sync /dev/sdl1
15 8 161 15 active sync /dev/sdk1
14 8 97 16 active sync /dev/sdg1
13 8 33 17 active sync /dev/sdc1
One of the entries from dmesg the first time reshaping stalled:
[105003.994653] INFO: task md0_reshape:3296 blocked for more than 120 seconds.
[105003.994916] Tainted: G I 4.19.0-6-amd64 #1
Debian 4.19.67-2+deb10u1
[105003.995169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[105003.995434] md0_reshape D 0 3296 2 0x80000000
[105003.995436] Call Trace:
[105003.995441] ? __schedule+0x2a2/0x870
[105003.995442] schedule+0x28/0x80
[105003.995448] reshape_request+0x862/0x940 [raid456]
[105003.995451] ? finish_wait+0x80/0x80
[105003.995454] raid5_sync_request+0x34a/0x3b0 [raid456]
[105003.995460] md_do_sync.cold.86+0x3f4/0x911 [md_mod]
[105003.995461] ? finish_wait+0x80/0x80
[105003.995464] ? __switch_to_asm+0x35/0x70
[105003.995467] ? md_rdev_init+0xb0/0xb0 [md_mod]
[105003.995471] md_thread+0x94/0x150 [md_mod]
[105003.995473] kthread+0x112/0x130
[105003.995475] ? kthread_bind+0x30/0x30
[105003.995476] ret_from_fork+0x35/0x40
I'm not sure where to continue troubleshooting. Been moving drives
around in the storage appliance in case there are bad ports on the
backplane but doesn't seem to make any difference. I only have one
4-port HBA right now so can't even try another. The storage appliance
has 2 controllers and already tried swapping them as well as the SAS
cables. Only thing left I can try hardware wise is to install the
interposers that go in between the drives and the backplane but not
very hopeful.
Any help would be greatly appreciated. I don't have the extra space
right now to copy everything out and create a new array so really
hoping to get past this stalling issue. Thanks.