On Mon, 6 Sep 2021 20:44:31 -0400 Ryan Patterson <ryan.goat@xxxxxxxxx> wrote: > My file server is usually very stable. The past week I had two mdadm > arrays that required recync operations. > * newly created raid6 array (14 x 16TB seagate exos) > * existing raid 6 array, after a reboot resync on hot spare (14 x 4TB > seagate barracuda) > > During both resync operations (they ran one at a time) the system > would routinely experience a major error and require a hard reboot, > every two or three hours. I saw several errors, such as: > * kernel watchdog soft lockups [md127_raid6:364] > * general protection faults (I have a few saved with the full exception stack) > * exceptions in iommu routines (again I have the full error with > exception stack saved) > * full system lockup So in other words the server is very stable, unless asked to do full-speed reads from all disks at the same time. I'd suggest to check or improve cooling on the HBA cards, and then try a different PSU. > I doubt there is a bug in mdadm that caused this behavior. But it was > very predictable and repeatable while the resync operations were in > progress. > > How can I avoid these errors the next time I have an array in need of a resync? > > OS: debian 11 bullseye > kernel: 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) > mdadm: v4.1 - 2018-10-01 > sata HBA: 3 x LSI SAS 9201-16i > _____________ > Ryan Patterson > May the wings of liberty never lose a feather. -- With respect, Roman