Reshape corrupting data after growing raid10

Natanji <natanji@xxxxxxxxx> · Thu, 17 Feb 2022 20:17:00 +0100

Hello everyone,
I have been shifting around my raid10 design a little. Originally it
had 6 devices (each around 3 TB) in layout near=2, with 512K chunks. I
then added two spares and did mdadm --grow --raid-devices=8 - that
took a few days, but was ultimately successful and everything was
running fine.

Now I tried the same thing again last night - added two devices as
spares, then wanted to grow to 10 devices. This time though I figured
the array was getting large enough for benefitting from a larger chunk
size, and I also wanted to change the layout from near=2 to offset=2.
Thus I did:

# mdadm --grow --raid-devices=10 --layout=o2 --chunk=1024K

This had a number of really weird effects. First of all, I immediately
got email notifications that four devices in the raid failed, right
after each other. Like within 5 seconds. Luckily (?) the raid did not
fail - no two failures were on consecutive devices. So the reshape
started nevertheless.

Worried, I checked /proc/mdstat and saw it fluctuating between 0.0%
and 99,9% progress. Checking dmesg I saw a lot of messages about
attempts to access the drives over the bounds of their actual size,
which kept restarting the array (and indeed, attempts to access data
on the md device were extremely slow). Sadly I don't have these exact
logs to share here, since the situation eventually deteriorated even
further: I noticed that I couldn't run mdadm commands anymore, since
it immediately exited with a Segmentation fault.

I immediately froze the reshape via sync_action, remounted / as ro and
used dd_rescue to copy the (still running) system to a backup.
Examining this backup on a remote machine, the ext4 filesystem was
corrupted. I managed to restore it via a backup of the superblock but
many inodes were corrupt and had to be deleted, so something
definitely went terribly wrong.

The system eventually locked up and I had to reboot it. I booted from
a Ubuntu live disk and assembled the array there. The reshape
immediately restarted, this time without the hitches, 99,9%
fluctuations, weird dmesg entries etc. - but I froze the reshape again
shortly after because I'm afraid that it might lead to more
corruption/destruction of data.

For more info, I run raid -> luks -> lvm, so the raid contains one
large luks encrypted volume. There are no errors from luks when
unlocking the device with my password - everything works fine so far.
I don't know if this is because the luks header perhaps resided in the
first 512K of the device, which I guess is the only block that would
not get moved around during the reshape.

I'm currently in the process of using ddrescue to make a backup of the
data in its current state. I guess most of it is still recoverable in
this state and I will definitely learn from this that just because one
reshape worked fine, I should still make an up-to-date backup before
any --grow. What's done is done.

Still, I wonder how and why this even happened. Was I wrong to assume
that when I issue a reshape because of increasing raid-devices, I can
change layout and chunk at the same time? Is the layout change from
near to far not a stable feature of mdadm (I thought it was for a
while now)? Or is it possible that if I let the reshape finish, it
will eventually fix the errors again? My assumption was that a reshape
would never affect/change the data when viewed from the virtual md
device, thus also not on the luks encrypted volume, and thus if there
is any change to my data during the reshape then clearly something
really weird happens during this reshape that absolutely should not
have happened. Is this a user error or is this a bug in mdadm?