BUG drivers/md/md.c: data-offset reshape renders array unloadable

"Wesley W. Terpstra" <wesley@xxxxxxxxxxx> · Sat, 24 Jan 2015 22:44:10 +0100

On Wed, Jan 21, 2015 at 1:37 PM, Wesley W. Terpstra <wesley@xxxxxxxxxxx> wrote:
> Try: madam -A -o --run --force --freeze-reshape /dev/md120 /dev/sd[a-d]4
> It failed the same as before. Logs in mdadm-a.log and dmesg.log

I have now tried 3.18.3, and had exactly the same problem.
To recap, for anyone new to this thread:

I have a raid5 array with 4 disks. I did a reshape to change the
data-offset from 5120 to 8192, changing nothing else. The number of
disks remained the same. The reshape was going fine and was probably
around 50% done when the system was rebooted. All the superblocks have
matching event counts, all checksums pass, and all disks have clean
bills of health. However, any attempt to reassemble the array fails.

I have gone ahead and applied a patch to md.c to diagnose WHY it does
not assemble the array. Find the patch I used (against 3.18.3)
attached. When I try to assemble the array using my patched 3.18.3,
this is what I see:

[    5.830139] md: FYI: 8192 old 8192 new
[    5.830167] md: sectors mismatch 5842894848 < 5842897920 on sdc4
[    5.830199] md: sdc4 does not have a valid v1.2 superblock, not importing!
[    5.830635] md: md_import_device returned -22

First, it is obviously the last test in super_1_load that is rejecting
the array. The superblock reports more sectors than are calculated, so
the check
        if (sectors < le64_to_cpu(sb->data_size)) {
fails.

IMO, it seems that the problem is that the superblock has had
rdev->data_offset updated prematurely! The reshape was incomplete, so
I would have expected data_offset to still be 5120 and new_data_offset
to be 8192. If data_offset HAD been 5120, the two values compared in
the failing inequality would be equal.

I am considering simply modifying my superblock to reset data_offset
to 5120 and see if the reshape then resumes correctly. Is this the
correct fix to recover my array? I imagine the source of my problems
must be a bug somewhere in mdadm reshape that updates both
new_data_offset and data_offset at the same time when doing a
data-offset reshape..?
Attachment:
debug.patch

Description: Binary data