On Tue, Sep 7, 2021 at 6:55 PM Ryan Patterson <ryan.goat@xxxxxxxxx> wrote: > > On Tue, Sep 7, 2021 at 5:18 AM Roman Mamedov <rm@xxxxxxxxxxx> wrote: > > > > On Tue, 7 Sep 2021 12:52:01 +0500 > > Roman Mamedov <rm@xxxxxxxxxxx> wrote: > > > > > On Mon, 6 Sep 2021 20:44:31 -0400 > > > Ryan Patterson <ryan.goat@xxxxxxxxx> wrote: > > > > > > > My file server is usually very stable. The past week I had two mdadm > > > > arrays that required recync operations. > > > > * newly created raid6 array (14 x 16TB seagate exos) > > > > * existing raid 6 array, after a reboot resync on hot spare (14 x 4TB > > > > seagate barracuda) > > > > > > > > During both resync operations (they ran one at a time) the system > > > > would routinely experience a major error and require a hard reboot, > > > > every two or three hours. I saw several errors, such as: > > > > * kernel watchdog soft lockups [md127_raid6:364] > > > > * general protection faults (I have a few saved with the full exception stack) > > > > * exceptions in iommu routines (again I have the full error with > > > > exception stack saved) > > > > * full system lockup > > > > > > So in other words the server is very stable, unless asked to do full-speed > > > reads from all disks at the same time. > > > > > > I'd suggest to check or improve cooling on the HBA cards, and then try a > > > different PSU. > > > > Also the motherboard chipset cooling, since that's a lot of PCI-E traffic. > > Maybe the CPU cooling as well, or at least check the CPU temperatures during > > this load. > > > > And since you have full logs and backtraces, there's no point in waiting to > > post those, just go ahead. Maybe they will point to something other than > > suspect hardware, or at least to which part of hardware to suspect. > > > > -- > > With respect, > > Roman > > Thanks for the suggestions. Hardware overheating might be my problem. > I have several (loud) case fans blowing away. But the HBA cards and > mobo southbridge are only passively cooled. Maybe I could mount fans > on each cards' headsink. I'll investigate. > > The power supply is not an off the shelf job. So I don't convientanly > have a replacement to try. I might have to bite the bullet and buy a > second. > > I forgot to put in my original note that I ran memtest86 on this > machine for four full cycles with no faults found. Also nothing is > overclocked. > > Here are some of the errors I recorded. Maybe somebody can see a > pattern in them... > > [snip] Just to provide closure to the system instability I reported four months ago. End of 2021 I replaced the motherboard, CPU, & RAM with "server grade" hardware (intel xeon). Ever since upgrading hardware, the system has been 100% stable. Even during sustained mdadm I/O workloads, etc. So it appears the consumer grade motherboard was at fault all along. Thanks for the help troubleshooting the issue. _____________ Ryan Patterson May the wings of liberty never lose a feather.