Re: mdadm grow raid 5 to 6 failure (crash)

Roger Heflin <rogerheflin@xxxxxxxxx> · Sun, 25 Jun 2023 17:31:42 -0500

I won't buy a marvell.    I had some sort of 92xx variant (2 pci-e
lanes) and it had a bad habit of stopping working under load and
causing all 4 ports to go offline.   Board maker blamed the driver
except the driver is/was the generic AHCI so unlikely to be the issue
since all other ports were also AHCI and were just fine.

Best luck is a used LSI SAS controller.   You can get 8 ports off of
one, but you may need a breakout cable to use for 4 sata devices.

On Fri, Jun 23, 2023 at 2:17 PM David Gilmour <dgilmour76@xxxxxxxxx> wrote:
>
> I wanted to provide an update on this thread. First of all thank you
> for all the insights and recommendations. I finally found a way to
> recover my data and wanted to pass what the fix was in the event
> someone stumbles across this exact scenario. Summary below
>  - I believe there is some kind of problem with kernel or module in
> 5.14.0-319.el9.x86_64 for my controller (ASMedia ASM1064 chipset)
> which I believe was responsible for the drives attached to it
> disappearing while my grow from raid 5 to raid 6 was taking place
>  - After the above event (and rebooting) whenever I tried to assemble
> the raid to kick off resuming the rebuild mdadm would hang as
> previously described in this thread.
>  - After Yu pointed me to a patch that might of bypass the issue I
> decided to first boot the system on a rescue disk with an older kernel
> (3.x) and mdadm version
>  - Fortunately, my assemble succeeded and the grow resumed and the
> slow rebuild of my 30TB array completed 17 days later
>  - My ASMedia ASM1064 chipset controller was 100% stable for the 17
> days of rebuild on the old kernel
>  - As soon as I went back to my 5.14.0-319.el9.x86_64 kernel my
> ASMedia ASM1064 controller started showing ata timeout errors and
> drives disappearing again
>  - I ended up just purchasing another controller with a different
> chipset (Marvell 88SE9215) out of desperation and the system is
> finally stable and my data is all intact!
>
> Again thank you everyone for the help!
>
> --David
>
>
> On Mon, May 8, 2023 at 8:33 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > 在 2023/05/09 6:53, Roger Heflin 写道:
> > > On Mon, May 8, 2023 at 6:57 AM David Gilmour <dgilmour76@xxxxxxxxx> wrote:
> > >>
> > >> Ok, well I'm willing to try anything at this point. Do you need
> > >> anything from me for a patch? Here is my current kernel details:
> > >
> > > grep -i mdadm /etc/udev/rules.d/* /lib/udev/rules.d/*
> > >
> > > If you can find a udev rule that starts up the monitor then move that
> > > rule out of the directory, so that on the next assemble try it does
> > > not get started.
> > >
> > > If this is the recent bug that is being discussed then anything
> > > accessing the array after the reshape will deadlock the array and the
> > > reshape.
> >
> > It's not anything accessing the array, in fact, it's only the io accross
> > reshape position can trigger the deadlock.
> >
> > I just posted a fix patch in the other thread by failing such io while
> > reshape can't make progress. However, I'm not sure for now if this will
> > break mdadm, for example, will mdadm must read something from array to
> > make progress?
> >
> > Thanks,
> > Kuai
> > > .
> > >
> >