Re: hung grow

NeilBrown <neilb@xxxxxxxx> · Wed, 11 Oct 2017 07:47:02 +1100

On Tue, Oct 10 2017, Curt wrote:

>>
>> Just --freeze-reshape, not --update.
>>
> Ok, here's the output
> mdadm --detail /dev/md127
> /dev/md127:
>            Version : 0.91
>      Creation Time : Fri Jun 15 15:52:05 2012
>         Raid Level : raid6
>         Array Size : 9767519360 (9315.03 GiB 10001.94 GB)
>      Used Dev Size : 1953503872 (1863.01 GiB 2000.39 GB)
>       Raid Devices : 8
>      Total Devices : 6
>    Preferred Minor : 127
>        Persistence : Superblock is persistent
>
>        Update Time : Tue Oct 10 15:11:26 2017
>              State : clean, FAILED, reshaping
>     Active Devices : 5
>    Working Devices : 6
>     Failed Devices : 0
>      Spare Devices : 1
>
>             Layout : left-symmetric
>         Chunk Size : 64K
>
> Consistency Policy : unknown
>
>     Reshape Status : 0% complete
>      Delta Devices : 1, (7->8)
>
>               UUID : 714a612d:9bd35197:36c91ae3:c168144d
>             Events : 0.11559682
>
>     Number   Major   Minor   RaidDevice State
>        0       8       97        0      active sync   /dev/sdg1
>        1       8       49        1      active sync   /dev/sdd1
>        2       8       33        2      active sync   /dev/sdc1
>        3       8        1        3      active sync   /dev/sda1
>        4      65      145        4      active sync   /dev/sdz1
>        -       0        0        5      removed
>        6       8       16        6      spare rebuilding   /dev/sdb
>        -       0        0        7      removed
>
> But in my dmesg, I'm seeing task md127_reshape blocked for 120
> seconds, and when I cat sync_action, it shows reshape.  Which
> shouldn't it be frozen or something like that?  Also md127_raid6 task
> is using 100% cpu.  I was going to paste the assemble output, but hit
> clear instead of copy.  It didn't show any errors I saw, just starting
> with 6 drives. reshape isn't using any cpu
>
> If I do a cat of /proc/pid/stack, all I get is
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Should I just let it run?

Clearly a kernel bug.
What kernel are you using?  Can you try a newer one easily?

Can you please
  mkdir /tmp/dump
  mdadm --dump=/dev/dump /dev...list.all.devices.in.the.array
  tar czf --sparse /tmp/dump.tgz /tmp/dump

and send me /tmp/dump.tgz.  It will only contains the metadata.
I can then create an identical looking array and experiment.

I doubt if letting it run will bring benefits.

NeilBrown
Attachment:
signature.asc

Description: PGP signature