Re: mdadm resync causes stable system to crash every 2 or 3 hours

Roger Heflin <rogerheflin@xxxxxxxxx> · Tue, 7 Sep 2021 12:55:54 -0500

My older system would reliably crash(total hardware reset, crashdump
did not dump, hardware would boot back up all very quickly) doing a
check / resync.   I finally on one of the boot ups saw the "machine
check" logged messages and decoded it.  Mine came back to PCIe error.
I disassembled, vacuumed and cleaned the connecters on the PCIE SAS
card, and for good measure moved said PCI'e card (LSI SAS2008  card)
to the one other x8 slot I had, just in case the slot itself was bad..
   Those changes made it reliable now.  Full check 1x per week, up 11
weeks now, MTBF prior was 2-3 weeks max.   So long as I did not do a
resync/check the above machine was "perfectly" stable and would stay
up.

So remember rsync puts a lot of stress on power supplies, it uses
PCI-e buses heavily, and a lot of other components.

This is what I had in messages when it was booting back up after the
hardware crash/cycle.
Jun 14 04:25:05 rahrah kernel: [    0.560897] mce: [Hardware Error]:
Machine check events logged
Jun 14 04:25:05 rahrah kernel: [    0.560897] mce: [Hardware Error]:
CPU 0: Machine Check: 0 Bank 4: f600000000070f0f

And note I have seen issues with the 12-16TB disks (this are OEM disks
so it is unclear who really made them, but it could very well be
seagate) where they seem to respond massively slow.  with timeouts
high enough they respond, but are so slow that they are unable to keep
up with the load the app needs.   The fix that has been being used is
to find the troublemaker disk (will show in various io tools as being
very busy/slow/slow response to smartctl commands also/and often has
bad sector count non-zero and rising) and replace it.    Were I have
experience with these disks we have a lot of them and replacements
with this process have been solving the issues.

I would certainly make sure that if you cannot set scterc then you set
the scsi timeouts high enough.  The 12-16TB ones aren't via mdadm they
are via a hardware raid controller, but said hardware raid controller
seems to have a lot of trouble dealing with the slow disks.

On Tue, Sep 7, 2021 at 4:20 AM Roman Mamedov <rm@xxxxxxxxxxx> wrote:
>
> On Tue, 7 Sep 2021 12:52:01 +0500
> Roman Mamedov <rm@xxxxxxxxxxx> wrote:
>
> > On Mon, 6 Sep 2021 20:44:31 -0400
> > Ryan Patterson <ryan.goat@xxxxxxxxx> wrote:
> >
> > > My file server is usually very stable.  The past week I had two mdadm
> > > arrays that required recync operations.
> > > * newly created raid6 array (14 x 16TB seagate exos)
> > > * existing raid 6 array, after a reboot resync on hot spare (14 x 4TB
> > > seagate barracuda)
> > >
> > > During both resync operations (they ran one at a time) the system
> > > would routinely experience a major error and require a hard reboot,
> > > every two or three hours.  I saw several errors, such as:
> > > * kernel watchdog soft lockups [md127_raid6:364]
> > > * general protection faults (I have a few saved with the full exception stack)
> > > * exceptions in iommu routines (again I have the full error with
> > > exception stack saved)
> > > * full system lockup
> >
> > So in other words the server is very stable, unless asked to do full-speed
> > reads from all disks at the same time.
> >
> > I'd suggest to check or improve cooling on the HBA cards, and then try a
> > different PSU.
>
> Also the motherboard chipset cooling, since that's a lot of PCI-E traffic.
> Maybe the CPU cooling as well, or at least check the CPU temperatures during
> this load.
>
> And since you have full logs and backtraces, there's no point in waiting to
> post those, just go ahead. Maybe they will point to something other than
> suspect hardware, or at least to which part of hardware to suspect.
>
> --
> With respect,
> Roman