Re: mdadm grow raid 5 to 6 failure (crash)

David Gilmour <dgilmour76@xxxxxxxxx> · Fri, 23 Jun 2023 13:17:00 -0600

I wanted to provide an update on this thread. First of all thank you
for all the insights and recommendations. I finally found a way to
recover my data and wanted to pass what the fix was in the event
someone stumbles across this exact scenario. Summary below
 - I believe there is some kind of problem with kernel or module in
5.14.0-319.el9.x86_64 for my controller (ASMedia ASM1064 chipset)
which I believe was responsible for the drives attached to it
disappearing while my grow from raid 5 to raid 6 was taking place
 - After the above event (and rebooting) whenever I tried to assemble
the raid to kick off resuming the rebuild mdadm would hang as
previously described in this thread.
 - After Yu pointed me to a patch that might of bypass the issue I
decided to first boot the system on a rescue disk with an older kernel
(3.x) and mdadm version
 - Fortunately, my assemble succeeded and the grow resumed and the
slow rebuild of my 30TB array completed 17 days later
 - My ASMedia ASM1064 chipset controller was 100% stable for the 17
days of rebuild on the old kernel
 - As soon as I went back to my 5.14.0-319.el9.x86_64 kernel my
ASMedia ASM1064 controller started showing ata timeout errors and
drives disappearing again
 - I ended up just purchasing another controller with a different
chipset (Marvell 88SE9215) out of desperation and the system is
finally stable and my data is all intact!

Again thank you everyone for the help!

--David

On Mon, May 8, 2023 at 8:33 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2023/05/09 6:53, Roger Heflin 写道:
> > On Mon, May 8, 2023 at 6:57 AM David Gilmour <dgilmour76@xxxxxxxxx> wrote:
> >>
> >> Ok, well I'm willing to try anything at this point. Do you need
> >> anything from me for a patch? Here is my current kernel details:
> >
> > grep -i mdadm /etc/udev/rules.d/* /lib/udev/rules.d/*
> >
> > If you can find a udev rule that starts up the monitor then move that
> > rule out of the directory, so that on the next assemble try it does
> > not get started.
> >
> > If this is the recent bug that is being discussed then anything
> > accessing the array after the reshape will deadlock the array and the
> > reshape.
>
> It's not anything accessing the array, in fact, it's only the io accross
> reshape position can trigger the deadlock.
>
> I just posted a fix patch in the other thread by failing such io while
> reshape can't make progress. However, I'm not sure for now if this will
> break mdadm, for example, will mdadm must read something from array to
> make progress?
>
> Thanks,
> Kuai
> > .
> >
>