Re: RAID5->RAID6 reshape remains stuck at 0% (does nothing, not even start)

antlists <antlists@xxxxxxxxxxxxxxx> · Wed, 30 Sep 2020 21:16:10 +0100

On 30/09/2020 20:45, David Madore wrote:
On Wed, Sep 30, 2020 at 08:03:32PM +0100, Wols Lists wrote:
On 30/09/20 19:58, David Madore wrote:
mdadm - v4.1 - 2018-10-01

- which I think is roughly contemporaneous to the kernel version I'm
using.  But the problem still persists (with the exact same symptoms
and details).

Except that mdadm is NOT the problem. The problem is that the kernel and
mdadm are not matched date-wise, and because the kernel is a
franken-kernel you need to use a different kernel.

I don't understand what you mean by "matched date-wise".  The kernel
I'm using is a longterm support branch (4.9) which was frozen at the
same approximate date as the mdadm I just installed.  And it was also
the same longterm support branch that was used in the Debian oldstable
(9 aka stretch).  Do you mean that there is no mdadm version which is
compatible with the 4.9 kernels?  How often does the mdadm-kernel
interface break compatibility?

The problem is that if you use mdadm 3.4 with kernel 4.9.237, the 237 
means that your kernel has been heavily updated and is far too new. But 
if you use mdadm 4.1 with kernel 4.9.237, the 4.9 means that the kernel 
is basically a very old one - too old for mdadm 4.1

As I said, the problem is the kernel - it is, at heart, an ancient 
kernel. And it hasn't been regression tested for raid reshapes. And what 
the problem is, we don't know exactly, nor do we particularly care, 
sorry. So long as your data isn't lost, the response here is pretty much 
the same as elsewhere, unfortunately - "run an up-to-date system".

Use a rescue disk!!! That way, you get a kernel and an mdadm that are
the same approximate date. As it stands, your frankenkernel is too new
for mdadm 3.4, but too ancient for a modern kernel.

Using a rescue disk would mean taking the system down for longer than
I can afford (I can afford to have this particular partition down for
a long time, but not the whole system...  which unfortunately resides
on the same disks).  So I'd like to keep this as a very last resort,
or at least, not consider it until I've fully understood what's going
on.  (It's especially problematic that I have absolutely no idea of
the speed at which I can expect the reshape to take place, compared to
an ordinary resync.  If you could give me a ballpark figure, it would
help me decide.  My disks resync at ~120MB/sec, and the RAID array I
wish to reshape is ~900GB in per partition, so it takes a few hours to
do an "ordinary" resync: I assume a reshape will take much longer, but
how much longer are we talking?)

What do you mean by a resync? Do you mean replacing a drive? Because I 
can't speak for certain, but I wouldn't expect a reshape to take much 
longer.

If you don't want to take the system down to use a rescue disk, I don't 
really know what to suggest. You could revert your kernel back to a 
4.9.x where x is a single digit, and it would probably work. Or you 
could install a modern 5.9 or similar kernel, but that might well break 
a load of other stuff. Or just upgrade to a new Debian/Ubuntu ... any of 
them *should* work, but the only options we'd recommend would be to 
upgrade your distro, or use a rescue disk. Sorry.

Cheers,
Wol