raid5 reshape stuck at 33%

apfc123@xxxxxxxxx · Thu, 28 Nov 2019 22:49:16 -0600

Hello,

Started out with 8 drives, added and grew 4 more at the same time on
mdadm 3.4. The reshape got stuck at 33% and errors showed up in dmesg
about no response for more than 120 seconds, tainted, call trace
etc...I tried rebooting and the reshape won't progress any further. If
I freeze the reshape, then I can mount and read the data, otherwise
mdadm commands won't respond (100% cpu usage)

I also tried booting from the latest debian live image with mdadm 4.1,
but reshape still won't progress past 33%. Suspecting drive issues
(smart tests failed, badblocks), I physically removed one at a time
and forced assemble to see if reshape would progress. I did this
twice, each time with a different drive, which I guess was a bad idea
because the array considers the 12th as a spare now after a --re-add.

Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdc1[1] sdi1[9](S) sdd1[11] sde1[12] sdk1[13]
sdl1[14] sdj1[10] sdf1[8] sdg1[6] sdb1[5] sdh1[4] sda1[3]
      27348203008 blocks super 1.2 level 5, 512k chunk, algorithm 2
[12/11] [_UUUUUUUUUUU]
      bitmap: 5/30 pages [20KB], 65536KB chunk

/dev/md127:
           Version : 1.2
     Creation Time : Fri Mar  4 02:28:46 2016
        Raid Level : raid5
        Array Size : 27348203008 (26081.28 GiB 28004.56 GB)
     Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB)
      Raid Devices : 12
     Total Devices : 12
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Nov 27 23:10:08 2019
             State : clean, degraded
    Active Devices : 11
   Working Devices : 12
    Failed Devices : 0
     Spare Devices : 1

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

     Delta Devices : 4, (8->12)

              Name : debian:one  (local to host debian)
              UUID : a6659be9:4545dfa0:678228ad:294eede4
            Events : 2022146

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       33        1      active sync   /dev/sdc1
       3       8        1        2      active sync   /dev/sda1
       4       8      113        3      active sync   /dev/sdh1
       5       8       17        4      active sync   /dev/sdb1
       6       8       97        5      active sync   /dev/sdg1
       8       8       81        6      active sync   /dev/sdf1
      10       8      145        7      active sync   /dev/sdj1
      14       8      177        8      active sync   /dev/sdl1
      13       8      161        9      active sync   /dev/sdk1
      12       8       65       10      active sync   /dev/sde1
      11       8       49       11      active sync   /dev/sdd1

       9       8      129        -      spare   /dev/sdi1

The reshape is still frozen, volume groups are mounted and I can read
the data. Don't remember when I tried the revert-reshape option but
had an error about reshape is not aligned, try stop and assemble
again. Not sure if it was when the array had 12 active devices or 11.
Is it possible to get back to the 8+4 active array state, and then
successfully revert still?

The initial grow command specified a backup file on a usb drive, but I
can't find it. I assume mdadm deleted it intentionally.

Thanks for any help.