Re: Can't replace drive in imsm RAID 5 array, spare not shown

19 Devices <19devices@xxxxxxxxx> · Thu, 10 Oct 2024 09:25:58 +0100

On 9 October 2024 11:09:40 BST, Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> wrote:
>On Sun, 06 Oct 2024 07:00:18 +0100
>19 Devices <19devices@xxxxxxxxx> wrote:
>
>> Hi, I have a 4 drive imsm RAID 5 array which is working fine.  I want to
>> remove one of the drives, sda, and replace it with a spare, sdc.  From man
>> mdadm I understand that add - fail - remove is the way to go but this does
>> not work.
>> 
>> Before:
>> $ cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md124 : active raid5 sdd[3] sdb[2] sda[1] sde[0]
>>       2831155200 blocks super external:/md126/0 level 5,
>>  128k chunk, algorithm 0 [4/4] [UUUU]
>> 
>> md125 : active raid5 sdd[3] sdb[2] sda[1] sde[0]
>>       99116032 blocks super external:/md126/1 level 5, 1
>> 28k chunk, algorithm 0 [4/4] [UUUU]
>> 
>> md126 : inactive sda[3](S) sdb[2](S) sdd[1](S) sde[0](S)
>>       14681 blocks super external:imsm
>> 
>> unused devices: <none>
>> 
>> 
>> I can add (or add-spare) which increases the size of the container and though
>> I can't see any spare drives listed by mdadm, it appears as SPARE DISK in the
>> Intel option ROM after a reboot.
>> 
>> $ sudo mdadm --zero-superblock /dev/sdc
>> 
>> $ sudo mdadm /dev/md/imsm1 --add-spare /de
>> v/sdc
>> mdadm: added /dev/sdc
>> 
>> $ cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md124 : active raid5 sdd[3] sdb[2] sda[1] sde[0]
>>       2831155200 blocks super external:/md126/0 level 5,
>>  128k chunk, algorithm 0 [4/4] [UUUU]
>> 
>> md125 : active raid5 sdd[3] sdb[2] sda[1] sde[0]
>>       99116032 blocks super external:/md126/1 level 5, 1
>> 28k chunk, algorithm 0 [4/4] [UUUU]
>> 
>> md126 : inactive sdc[4](S) sda[3](S) sdb[2](S) sdd[1](S) sde[0](S)
>>       15786 blocks super external:imsm
>> 
>> unused devices: <none>
>> $
>> 
>> 
>> No spare devices listed here:
>> 
>> $ sudo mdadm -D /dev/md/imsm1
>> /dev/md/imsm1:
>>            Version : imsm
>>         Raid Level : container
>>      Total Devices : 5
>> 
>>    Working Devices : 5
>> 
>> 
>>               UUID : bdb7f495:21b8c189:e496c216:6f2d6c4c
>>      Member Arrays : /dev/md/md1_0 /dev/md/md0_0
>> 
>>     Number   Major   Minor   RaidDevice
>> 
>>        -       8       64        -        /dev/sde
>>        -       8       32        -        /dev/sdc
>>        -       8        0        -        /dev/sda
>>        -       8       48        -        /dev/sdd
>>        -       8       16        -        /dev/sdb
>> $
>> 
>Hello,
>
>I know. It is fine. From container point of view these all are spares.
>Nobody ever complained about that so we did not fixed it :)
>The most important is that all drives are here.
>
>To detect spares you must compare this list with list from #mdadm --detail
>/dev/md124 (member array). Drives that are not used in member array are spares.
>> 
>> Trying to remove sda fails.
>> 
>> $ sudo mdadm --fail /dev/md126 /dev/sda
>> mdadm: Cannot remove /dev/sda from /dev/md126, array will be failed.
>
>It might be an issue in mdadm, we added this and later we added fixes:
>
>Commit:
>https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/commit/?id=fc6fd4063769f4194c3fb8f77b32b2819e140fb9
>
>Fixes:
>https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/commit/?id=b3e7b7eb1dfedd7cbd9a3800e884941f67d94c96
>https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/commit/?id=461fae7e7809670d286cc19aac5bfa861c29f93a
>
>but your release is mdadm-4.3, all fixes should be there. It might be a new bug.
>
>Try:
>#mdadm -If sda
>but please do not abuse it (just use it one time because it may fail your
>array). According to mdstat it should be safe in this case.
>
>If you can do some investigation, I would be tankful, I expect issues
>in enough() function.
>
>Thanks,
>Mariusz
>
>> 
>> sda is 2TB, the others are 1TB - is that a problem?
>> 
>> smartctl shows 2 drives don't support  SCT and it's disabled on the other 3.
>> 
>> There's a very similar question here from Edwin in 2017:
>> https://unix.stackexchange.com/questions/372908/add-hot-spare-drive-to-intel-rst-onboard-raid#372920
>> 
>> The only reply points to an Intel doc which uses the standard command to add
>> a drive but doesn't show the result.
>> 
>> $ uname -a
>> Linux Intel 6.9.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 26
>>  May 2024 01:30:29 +0000 x86_64 GNU/Linux
>> 
>> $ mdadm --version
>> mdadm - v4.3 - 2024-02-15
>> 
>

---------------------------------------

Thank you Mariusz, that (--incremental --fail) worked:

# mdadm -If sda
mdadm: set sda faulty in md124
mdadm: set sda faulty in md125
mdadm: hot removed sda from md126

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md124 : active raid5 sdc[4] sdd[3] sdb[2] sde[0]
      2831155200 blocks super external:/md126/0 level 5,
 128k chunk, algorithm 0 [4/3] [UU_U]
      [>....................]  recovery =  0.2% (2275456
/943718400) finish=222.5min speed=70515K/sec

md125 : active raid5 sdc[4] sdd[3] sdb[2] sde[0]
      99116032 blocks super external:/md126/1 level 5, 1
28k chunk, algorithm 0 [4/3] [UU_U]
        resync=DELAYED

md126 : inactive sdc[4](S) sdb[2](S) sdd[1](S) sde[0](S)
      10585 blocks super external:imsm

unused devices: <none>
#

# journalctl -f
kernel: md/raid:md124: Disk failure on sda, disabling device.
kernel: md/raid:md124: Operation continuing on 3 devices.
kernel: md/raid:md125: Disk failure on sda, disabling device.
kernel: md/raid:md125: Operation continuing on 3 devices.
kernel: md: recovery of RAID array md124
kernel: md: delaying recovery of md125 until md124 has finished (they share one or more physical units)
mdadm[628]: mdadm: Fail event detected on md device /dev/md125, component device /dev/sda
mdadm[628]: mdadm: RebuildStarted event detected on md device /dev/md124
Intel mdadm[628]: mdadm: Fail event detected on md device /dev/md124, component device /dev/sda

---------------------------------------

ps. Belated thanks too for your solution to my previous problem here on 2021/08/02.  That fix showed no sign it had succeeded until reboot but after that all was fine.