Re: Raid-6 won't boot

Roger Heflin <rogerheflin@xxxxxxxxx> · Tue, 31 Mar 2020 11:30:29 -0500

If you did not use external files, you are better off, it only
requires the external files in a subset of cases.  I think there is a
way to start a reshape, someone should know how to do that now that we
can at least find your raid and see it with activation failing.  You
may want to start a new thread with how to resume a reshape after it
aborted (with a summary of the original and the error you are now
getting).   You may want to use the newest fedora livecd you can find
as the original issue may have been a bug in your old kernel.   If you
can get the reshape going I would let it finish on that livecd so that
the old system does not have to do a reshape with what may be a buggy
kernel.

I also have typically avoided the rescue cd's and stayed with full
livecd's because of the limited tool sets and functionality on the
dedicated rescue ones.  Usually I pick a random fedora liivecd to use
as a rescue disk and that in general has worked very well in a wide
variety of ancient OS'es (compared to the really new fedora livecd).

On Tue, Mar 31, 2020 at 11:20 AM Alexander Shenkin <al@xxxxxxxxxxx> wrote:
>
> Yes, I had added a drive and it was busy copying data to the new drive
> when the reshape slowed down gradually, and eventually the system locked
> up.  I didn't change raid configurations or anything like that - just
> added a drive.  I didn't use any external files, so not sure if i'd be
> able to recover any... i suspect not...
>
> thanks,
> allie
>
> On 3/31/2020 5:16 PM, Roger Heflin wrote:
> > were you doing a reshape when it was rebooted?    And if so did you
> > have to use an external file when doing the reshape and were was that
> > file?   I think there is a command to restart a reshape using an
> > external file.
> >
> > On Tue, Mar 31, 2020 at 11:13 AM Alexander Shenkin <al@xxxxxxxxxxx> wrote:
> >>
> >> quick followup: trying a stop and assemble results in the message that
> >> it "Failed to restore critical section for reshape, sorry".
> >>
> >> On 3/31/2020 11:08 AM, Alexander Shenkin wrote:
> >>> Thanks Roger,
> >>>
> >>> It seems only the Raid1 module is loaded.  I didn't find a
> >>> straightforward way to get that module loaded... any suggestions?  Or,
> >>> will I have to find another livecd that contains raid456?
> >>>
> >>> Thanks,
> >>> Allie
> >>>
> >>> On 3/30/2020 9:45 PM, Roger Heflin wrote:
> >>>> They all seem to be there, all seem to report all 7 disks active, so
> >>>> it does not appear to be degraded. All event counters are the same.
> >>>> Something has to be causing them to not be scanned and assembled at
> >>>> all.
> >>>>
> >>>> Is the rescue disk a similar OS to what you have installed?  If it is
> >>>> you might try a random say fedora livecd and see if it acts any
> >>>> different.
> >>>>
> >>>> what does fdisk -l /dev/sda look like?
> >>>>
> >>>> Is the raid456 module loaded (lsmod | grep raid)?
> >>>>
> >>>> what does cat /proc/cmdline look like?
> >>>>
> >>>> you might also run this:
> >>>> file -s /dev/sd*3
> >>>> But I think it is going to show us the same thing as what the mdadm
> >>>> --examine is reporting.
> >>>>
> >>>> On Mon, Mar 30, 2020 at 3:05 PM Alexander Shenkin <al@xxxxxxxxxxx> wrote:
> >>>>>
> >>>>> See attached.  I should mention that the last drive i added is on a new
> >>>>> controller that is separate from the other drives, but seemed to work
> >>>>> fine for a bit, so kinda doubt that's the issue...
> >>>>>
> >>>>> thanks,
> >>>>>
> >>>>> allie
> >>>>>
> >>>>> On 3/30/2020 6:21 PM, Roger Heflin wrote:
> >>>>>> do this against each partition that had it:
> >>>>>>
> >>>>>>  mdadm --examine /dev/sd***
> >>>>>>
> >>>>>> It seems like it is not seeing it as a md-raid.
> >>>>>>
> >>>>>> On Mon, Mar 30, 2020 at 11:13 AM Alexander Shenkin <al@xxxxxxxxxxx> wrote:
> >>>>>>> Thanks Roger,
> >>>>>>>
> >>>>>>> The only line that isn't commented out in /etc/mdadm.conf is "DEVICE
> >>>>>>> partitions"...
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Allie
> >>>>>>>
> >>>>>>> On 3/30/2020 4:53 PM, Roger Heflin wrote:
> >>>>>>>> That seems really odd.  Is the raid456 module loaded?
> >>>>>>>>
> >>>>>>>> On mine I see messages like this for each disk it scanned and
> >>>>>>>> considered as maybe possibly being an array member.
> >>>>>>>>  kernel: [   83.468700] md/raid:md13: device sdi3 operational as raid disk 5
> >>>>>>>> and messages like this:
> >>>>>>>>  md/raid:md14: not clean -- starting background reconstruction
> >>>>>>>>
> >>>>>>>> You might look at /etc/mdadm.conf on the rescue cd and see if it has a
> >>>>>>>> DEVICE line that limits what is being scanned.
> >>>>>>>>
> >>>>>>>> On Mon, Mar 30, 2020 at 10:13 AM Alexander Shenkin <al@xxxxxxxxxxx> wrote:
> >>>>>>>>> Thanks Roger,
> >>>>>>>>>
> >>>>>>>>> that grep just returns the detection of the raid1 (md127).  See dmesg
> >>>>>>>>> and mdadm --detail results attached.
> >>>>>>>>>
> >>>>>>>>> Many thanks,
> >>>>>>>>> allie
> >>>>>>>>>
> >>>>>>>>> On 3/28/2020 1:36 PM, Roger Heflin wrote:
> >>>>>>>>>> Try this grep:
> >>>>>>>>>> dmesg | grep "md/raid", if that returns nothing if you can just send
> >>>>>>>>>> the entire dmesg.
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Mar 28, 2020 at 2:47 AM Alexander Shenkin <al@xxxxxxxxxxx> wrote:
> >>>>>>>>>>> Thanks Roger.  dmesg has nothing in it referring to md126 or md127....
> >>>>>>>>>>> any other thoughts on how to investigate?
> >>>>>>>>>>>
> >>>>>>>>>>> thanks,
> >>>>>>>>>>> allie
> >>>>>>>>>>>
> >>>>>>>>>>> On 3/27/2020 3:55 PM, Roger Heflin wrote:
> >>>>>>>>>>>> A non-assembled array always reports raid1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I would run "dmesg | grep md126" to start with and see what it reports it saw.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Mar 27, 2020 at 10:29 AM Alexander Shenkin <al@xxxxxxxxxxx> wrote:
> >>>>>>>>>>>>> Thanks Wol,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Booting in SystemRescueCD and looking in /proc/mdstat, two arrays are
> >>>>>>>>>>>>> reported.  The first (md126) in reported as inactive with all 7 disks
> >>>>>>>>>>>>> listed as spares.  The second (md127) is reported as active
> >>>>>>>>>>>>> auto-read-only with all 7 disks operational.  Also, the only
> >>>>>>>>>>>>> "personality" reported is Raid1.  I could go ahead with your suggestion
> >>>>>>>>>>>>> of mdadm --stop array and then mdadm --assemble, but I thought the
> >>>>>>>>>>>>> reporting of just the Raid1 personality was a bit strange, so wanted to
> >>>>>>>>>>>>> check in before doing that...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Allie
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 3/26/2020 10:00 PM, antlists wrote:
> >>>>>>>>>>>>>> On 26/03/2020 17:07, Alexander Shenkin wrote:
> >>>>>>>>>>>>>>> I surely need to boot with a rescue disk of some sort, but from there,
> >>>>>>>>>>>>>>> I'm not sure exactly when I should do.  Any suggestions are very welcome!
> >>>>>>>>>>>>>> Okay. Find a liveCD that supports raid (hopefully something like
> >>>>>>>>>>>>>> SystemRescueCD). Make sure it has a very recent kernel and the latest
> >>>>>>>>>>>>>> mdadm.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> All being well, the resync will restart, and when it's finished your
> >>>>>>>>>>>>>> system will be fine. If it doesn't restart on its own, do an "mdadm
> >>>>>>>>>>>>>> --stop array", followed by an "mdadm --assemble"
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If that doesn't work, then
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Wol