Re: Issue with moving LSI/Dell Raid to MD

Shaya Potter <spotter@xxxxxxxxx> · Mon, 18 Mar 2024 13:42:18 +0200

On Mon, Mar 18, 2024 at 1:18 PM Mariusz Tkaczyk
<mariusz.tkaczyk@xxxxxxxxxxxxxxx> wrote:
>
> On Sat, 16 Mar 2024 20:26:15 +0200
> Shaya Potter <spotter@xxxxxxxxx> wrote:
>
> > note: not subscribed, so please cc me on responses.
> >
> > I recently had a Dell R710 die where I was using the Perc6 to provide
> > storage to the box.  As the box wasn't usable, I decided to image the
> > individual disks to a newer machine with significantly more storage.
> >
> > I sort of messed up the progress, but that might have discovered a bug in
> > mdadm.
> >
> > Background, the Dell R710 supported 6 drives, which I had as a 1TB
> > SATA SSD and 5x8TB SATA disks in a RAID5 array.
> >
> > In the process of imaging it, I I was setting up devices on /dev/loop
> > to be prepared to assemble the raid, but I think I accidentally
> > assembled the raid while imaging the last disk (which in effect caused
> > the last disk to get out of sync with the other disks.  This was
> > initially ok, until the VM I was doing it on, crashed with a KVM/QEMU
> > failure (unsure what occurred).
> >
> > I was hoping, it was going to be easy to bring up the raid array
> > again, but now mdadm was segfault on a null pointer exception whenever
> > I tried to assemble the array (was just trying the RAID5 portion).
> >
> > I was thinking perhaps my VM got corrupted, but I couldn't figure that
> > out, so I decided to try and reimage the disks (more carefully this
> > time), but yes, the 5th disk was marked as in quick init, while the
> > others were more consistent.
> >
> > Howvever, same segfault was occuring, so I built mdadm from source
> > (with -g and no -O, as an aside, this would be a good Makefile target
> > to have, to make issues easier to debug)
> >
> > After understanding the issue, the segfault seems to be due to
> > Assemble.c wanting to call update_super() with a ddf super.  Except
> > super-ddf.c doesn't provide that.
> >
> > i.e. in Assemble.c it was crashing at
> >
> > if (st->ss->update_super(st, &devices[j].i, UOPT_SPEC_ASSEMBLE, NULL,
> > c->verbose, 0, NULL)) {...}
> >
> > which now explained the seg fault on null pointer exception.  I was
> > able to progress past the segfault (perhaps badly, but it "seems" to
> > work for me), by putting in a null check before the update_super()
> > call, i.e.
> >
> > if (st->ss->update_super && st->ss->update_super(....)) { ... }
> >
> > thoughts about my "fix" (perhaps super-ddf.c needs an empty
> > update_super function?) , if this is a bug? (perhaps its unexpected
> > for me to have gotten into this state in the first place?)
> >
>
> Hello Shaya,
> DDF is not actively developed. I'm considering dropping
> it.
> If you are interested in bringing it too life then you are
> more than welcome to send patches!
>
> If DDF doesn't implement update_super() then fix proposed by you seems to be
> valid. Please send proper patch for that then we will review it.
>
> Thanks,
> Mariusz

I'll make a proper patch in the coming days.

just to note: it is very useful for recovering from RAID arrays that
do provide that metadata.  It would be a shame (IMO) to lose support
for it, as it would have made my recovery/migration efforts much more
difficult.  At worst, I'd suggest marking it unmaintained, needing a
specific flag to be used which notes, since it's unmaintained, it
might go down code paths that are untested and could break in future
(i.e. what happened to me).

As a total other aside: md seems to work much better (performance
wise) when using loop devices when the loop devices are created with
direct-io support.