Re: mdadm resync causes stable system to crash every 2 or 3 hours

Ryan Patterson <ryan.goat@xxxxxxxxx> · Sat, 15 Jan 2022 10:46:52 -0500

On Tue, Sep 7, 2021 at 6:55 PM Ryan Patterson <ryan.goat@xxxxxxxxx> wrote:
>
> On Tue, Sep 7, 2021 at 5:18 AM Roman Mamedov <rm@xxxxxxxxxxx> wrote:
> >
> > On Tue, 7 Sep 2021 12:52:01 +0500
> > Roman Mamedov <rm@xxxxxxxxxxx> wrote:
> >
> > > On Mon, 6 Sep 2021 20:44:31 -0400
> > > Ryan Patterson <ryan.goat@xxxxxxxxx> wrote:
> > >
> > > > My file server is usually very stable.  The past week I had two mdadm
> > > > arrays that required recync operations.
> > > > * newly created raid6 array (14 x 16TB seagate exos)
> > > > * existing raid 6 array, after a reboot resync on hot spare (14 x 4TB
> > > > seagate barracuda)
> > > >
> > > > During both resync operations (they ran one at a time) the system
> > > > would routinely experience a major error and require a hard reboot,
> > > > every two or three hours.  I saw several errors, such as:
> > > > * kernel watchdog soft lockups [md127_raid6:364]
> > > > * general protection faults (I have a few saved with the full exception stack)
> > > > * exceptions in iommu routines (again I have the full error with
> > > > exception stack saved)
> > > > * full system lockup
> > >
> > > So in other words the server is very stable, unless asked to do full-speed
> > > reads from all disks at the same time.
> > >
> > > I'd suggest to check or improve cooling on the HBA cards, and then try a
> > > different PSU.
> >
> > Also the motherboard chipset cooling, since that's a lot of PCI-E traffic.
> > Maybe the CPU cooling as well, or at least check the CPU temperatures during
> > this load.
> >
> > And since you have full logs and backtraces, there's no point in waiting to
> > post those, just go ahead. Maybe they will point to something other than
> > suspect hardware, or at least to which part of hardware to suspect.
> >
> > --
> > With respect,
> > Roman
>
> Thanks for the suggestions.  Hardware overheating might be my problem.
> I have several (loud) case fans blowing away.  But the HBA cards and
> mobo southbridge are only passively cooled.  Maybe I could mount fans
> on each cards' headsink.  I'll investigate.
>
> The power supply is not an off the shelf job.  So I don't convientanly
> have a replacement to try.  I might have to bite the bullet and buy a
> second.
>
> I forgot to put in my original note that I ran memtest86 on this
> machine for four full cycles with no faults found.  Also nothing is
> overclocked.
>
> Here are some of the errors I recorded.  Maybe somebody can see a
> pattern in them...
>
> [snip]

Just to provide closure to the system instability I reported four
months ago.  End of 2021 I replaced the motherboard, CPU, & RAM with
"server grade" hardware (intel xeon).  Ever since upgrading hardware,
the system has been 100% stable.  Even during sustained mdadm I/O
workloads, etc.

So it appears the consumer grade motherboard was at fault all along.
Thanks for the help troubleshooting the issue.
_____________
Ryan Patterson
May the wings of liberty never lose a feather.