There would be various messages. grep -E 'ATA| sd |ata[0-9]' /var/log/messages might get you details. It will also show when the disks are first showing up and being reported. Timeouts look kind of like this: ata5: SError: { Handshk } ata5.00: failed command: WRITE FPDMA QUEUED ata5.00: cmd 61/40:58:40:e8:88/00:00:e8:00:00/40 tag 11 ncq dma 32768 out#012 res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10 (ATA bus error) ata5.00: status: { DRDY } ata5.00: failed command: WRITE FPDMA QUEUED ata5.00: cmd 61/18:60:48:ea:88/00:00:e8:00:00/40 tag 12 ncq dma 12288 out#012 res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10 (ATA bus error) ata5.00: status: { DRDY } ata5.00: failed command: WRITE FPDMA QUEUED ata5.00: cmd 61/08:68:00:eb:88/00:00:e8:00:00/40 tag 13 ncq dma 4096 out#012 res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10 (ATA bus error) ata5.00: status: { DRDY } ata5.00: failed command: WRITE FPDMA QUEUED ata5.00: cmd 61/08:78:60:ea:88/00:00:e8:00:00/40 tag 15 ncq dma 4096 out#012 res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10 (ATA bus error) ata5.00: status: { DRDY } ata5.00: failed command: WRITE FPDMA QUEUED ata5.00: cmd 61/08:c8:f8:e5:88/02:00:e8:00:00/40 tag 25 ncq dma 266240 out#012 res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10 (ATA bus error) ata5.00: status: { DRDY } ata5.00: failed command: WRITE FPDMA QUEUED ata5.00: cmd 61/40:d0:00:e8:88/00:00:e8:00:00/40 tag 26 ncq dma 32768 out#012 res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10 (ATA bus error) ata5.00: status: { DRDY } ata5.00: failed command: WRITE FPDMA QUEUED ata5.00: cmd 61/c8:d8:80:e8:88/01:00:e8:00:00/40 tag 27 ncq dma 233472 out#012 res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10 (ATA bus error) ata5.00: status: { DRDY } ata5.00: failed command: WRITE FPDMA QUEUED ata5.00: cmd 61/08:f8:90:eb:88/00:00:e8:00:00/40 tag 31 ncq dma 4096 out#012 res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10 (ATA bus error) ata5.00: status: { DRDY } ata5: hard resetting link ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata5.00: configured for UDMA/133 ata5: EH complete [4544065.390549] ata4.00: exception Emask 0x10 SAct 0xc000 SErr 0x400000 action 0x6 frozen [4544065.392582] ata4.00: irq_stat 0x08000000, interface fatal error [4544065.394543] ata4: SError: { Handshk } [4544065.396595] ata4.00: failed command: WRITE FPDMA QUEUED [4544065.398523] ata4.00: cmd 61/40:70:98:2d:ea/00:00:85:00:00/40 tag 14 ncq dma 32768 out [4544065.398523] res 40/00:7c:18:2e:ea/00:00:85:00:00/40 Emask 0x10 (ATA bus error) [4544065.402441] ata4.00: status: { DRDY } [4544065.404753] ata4.00: failed command: WRITE FPDMA QUEUED [4544065.406946] ata4.00: cmd 61/40:78:18:2e:ea/00:00:85:00:00/40 tag 15 ncq dma 32768 out [4544065.406946] res 40/00:7c:18:2e:ea/00:00:85:00:00/40 Emask 0x10 (ATA bus error) [4544065.410850] ata4.00: status: { DRDY } [4544065.412787] ata4: hard resetting link [4544065.877609] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [4544065.880880] ata4.00: configured for UDMA/133 [4544065.882816] ata4: EH complete ata4.00: exception Emask 0x10 SAct 0xc000 SErr 0x400000 action 0x6 frozen ata4.00: irq_stat 0x08000000, interface fatal error ata4: SError: { Handshk } ata4.00: failed command: WRITE FPDMA QUEUED ata4.00: cmd 61/40:70:98:2d:ea/00:00:85:00:00/40 tag 14 ncq dma 32768 out#012 res 40/00:7c:18:2e:ea/00:00:85:00:00/40 Emask 0x10 (ATA bus error) ata4.00: status: { DRDY } ata4.00: failed command: WRITE FPDMA QUEUED ata4.00: cmd 61/40:78:18:2e:ea/00:00:85:00:00/40 tag 15 ncq dma 32768 out#012 res 40/00:7c:18:2e:ea/00:00:85:00:00/40 Emask 0x10 (ATA bus error) ata4.00: status: { DRDY } ata4: hard resetting link ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata4.00: configured for UDMA/133 ata4: EH complete The autoreboot only happens after the machine has already 'crashed' and would have been otherwise unresponsive anyway. On Wed, Dec 15, 2021 at 3:53 PM Wol <antlists@xxxxxxxxxxxxxxx> wrote: > > On 15/12/2021 16:45, Roger Heflin wrote: > > If you cannot login to the machine via ssh, also try pinging it. If > > ping works but ssh does not either ssh died, or the machine is paging > > so heavily that user space cannot respond in a reasonable time. > > "Unable to resolve host name 'thewolery'" > > Paging is EXTREMELY unlikely with 32GB ram ... :-) > > > > If the disk were an issue there should be messages about something in > > the disk layer timing out, but it sounds like there aren't any of > > those sorts of messages. If it was a controller hardware/pci slot/hw > > issue that will in some cases cause an immediate power cycle and boot > > back up. > > Where do I look for those after a reboot? The system basically is > completely unresponsive - so no it's not a reset or anything, the system > just stops... > > > > You might also configure kdump, there should be doc's someplace on > > configuring it for your distribution, once configured then test it > > with "echo c > /proc/sysrq-trigger" and that should crash the machine > > and leave you with a kernel core dump + dmesg from the time of the > > crash. Also if kdump is configured and working it will crash/dump > > memory and typically boot back up automatically. > > I'll have to try it, although an autoreboot might not be a particularly > good idea ... > > > > On Wed, Dec 15, 2021 at 3:54 AM Wols Lists <antlists@xxxxxxxxxxxxxxx> wrote: > >> > >> Don't know if this is off-topic or not, seeing as my system is very much > >> reliant on raid ... > >> > >> But basically I'm seeing the system just stop responding. Typically it's > >> in screensaver mode, I've got a blank screen, and it won't wake up. (I > >> used to think it was something to do with Thunderbird, it mostly > >> happened while TB was hammering the system, but no ...) > >> > >> Today, I had it happen while the system was idle but not in screensaver, > >> I run xosview, and everything was clearly frozen - including xosview. > >> > >> As you might know, my stack is ext4 over lvm (over raid over > >> dm-integrity for /home) over spinning rust. > >> > >> And I run gentoo/systemd - currently on the latest stable kernel afaik, > >> 5.10.76-gentoo-r1 SMP x86_64. > >> > >> Any advice on how to debug a hang - basically I need something that'll > >> just sit there so when it crashes (and I press the reset button to > >> recover) I'll have some sort of trace. It would be nice to prove it's > >> not the disk stack at fault ... > >> > >> Obviously, "set these options in the kernel" won't faze me ... > >> > >> Cheers, > >> Wol