Hello everyone, I have been shifting around my raid10 design a little. Originally it had 6 devices (each around 3 TB) in layout near=2, with 512K chunks. I then added two spares and did mdadm --grow --raid-devices=8 - that took a few days, but was ultimately successful and everything was running fine. Now I tried the same thing again last night - added two devices as spares, then wanted to grow to 10 devices. This time though I figured the array was getting large enough for benefitting from a larger chunk size, and I also wanted to change the layout from near=2 to offset=2. Thus I did: # mdadm --grow --raid-devices=10 --layout=o2 --chunk=1024K This had a number of really weird effects. First of all, I immediately got email notifications that four devices in the raid failed, right after each other. Like within 5 seconds. Luckily (?) the raid did not fail - no two failures were on consecutive devices. So the reshape started nevertheless. Worried, I checked /proc/mdstat and saw it fluctuating between 0.0% and 99,9% progress. Checking dmesg I saw a lot of messages about attempts to access the drives over the bounds of their actual size, which kept restarting the array (and indeed, attempts to access data on the md device were extremely slow). Sadly I don't have these exact logs to share here, since the situation eventually deteriorated even further: I noticed that I couldn't run mdadm commands anymore, since it immediately exited with a Segmentation fault. I immediately froze the reshape via sync_action, remounted / as ro and used dd_rescue to copy the (still running) system to a backup. Examining this backup on a remote machine, the ext4 filesystem was corrupted. I managed to restore it via a backup of the superblock but many inodes were corrupt and had to be deleted, so something definitely went terribly wrong. The system eventually locked up and I had to reboot it. I booted from a Ubuntu live disk and assembled the array there. The reshape immediately restarted, this time without the hitches, 99,9% fluctuations, weird dmesg entries etc. - but I froze the reshape again shortly after because I'm afraid that it might lead to more corruption/destruction of data. For more info, I run raid -> luks -> lvm, so the raid contains one large luks encrypted volume. There are no errors from luks when unlocking the device with my password - everything works fine so far. I don't know if this is because the luks header perhaps resided in the first 512K of the device, which I guess is the only block that would not get moved around during the reshape. I'm currently in the process of using ddrescue to make a backup of the data in its current state. I guess most of it is still recoverable in this state and I will definitely learn from this that just because one reshape worked fine, I should still make an up-to-date backup before any --grow. What's done is done. Still, I wonder how and why this even happened. Was I wrong to assume that when I issue a reshape because of increasing raid-devices, I can change layout and chunk at the same time? Is the layout change from near to far not a stable feature of mdadm (I thought it was for a while now)? Or is it possible that if I let the reshape finish, it will eventually fix the errors again? My assumption was that a reshape would never affect/change the data when viewed from the virtual md device, thus also not on the luks encrypted volume, and thus if there is any change to my data during the reshape then clearly something really weird happens during this reshape that absolutely should not have happened. Is this a user error or is this a bug in mdadm?