Re: Debugging system hangs

Roger Heflin <rogerheflin@xxxxxxxxx> · Wed, 15 Dec 2021 16:05:13 -0600

There would be various messages.
 grep -E 'ATA| sd |ata[0-9]' /var/log/messages
might get you details.  It will also show when the disks are first
showing up and being reported.

Timeouts look kind of like this:
ata5: SError: { Handshk }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/40:58:40:e8:88/00:00:e8:00:00/40 tag 11 ncq dma 32768
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/18:60:48:ea:88/00:00:e8:00:00/40 tag 12 ncq dma 12288
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:68:00:eb:88/00:00:e8:00:00/40 tag 13 ncq dma 4096
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:78:60:ea:88/00:00:e8:00:00/40 tag 15 ncq dma 4096
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:c8:f8:e5:88/02:00:e8:00:00/40 tag 25 ncq dma 266240
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/40:d0:00:e8:88/00:00:e8:00:00/40 tag 26 ncq dma 32768
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/c8:d8:80:e8:88/01:00:e8:00:00/40 tag 27 ncq dma 233472
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:f8:90:eb:88/00:00:e8:00:00/40 tag 31 ncq dma 4096
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata5.00: configured for UDMA/133
ata5: EH complete
[4544065.390549] ata4.00: exception Emask 0x10 SAct 0xc000 SErr
0x400000 action 0x6 frozen
[4544065.392582] ata4.00: irq_stat 0x08000000, interface fatal error
[4544065.394543] ata4: SError: { Handshk }
[4544065.396595] ata4.00: failed command: WRITE FPDMA QUEUED
[4544065.398523] ata4.00: cmd 61/40:70:98:2d:ea/00:00:85:00:00/40 tag
14 ncq dma 32768 out
[4544065.398523]          res 40/00:7c:18:2e:ea/00:00:85:00:00/40
Emask 0x10 (ATA bus error)
[4544065.402441] ata4.00: status: { DRDY }
[4544065.404753] ata4.00: failed command: WRITE FPDMA QUEUED
[4544065.406946] ata4.00: cmd 61/40:78:18:2e:ea/00:00:85:00:00/40 tag
15 ncq dma 32768 out
[4544065.406946]          res 40/00:7c:18:2e:ea/00:00:85:00:00/40
Emask 0x10 (ATA bus error)
[4544065.410850] ata4.00: status: { DRDY }
[4544065.412787] ata4: hard resetting link
[4544065.877609] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[4544065.880880] ata4.00: configured for UDMA/133
[4544065.882816] ata4: EH complete
ata4.00: exception Emask 0x10 SAct 0xc000 SErr 0x400000 action 0x6 frozen
ata4.00: irq_stat 0x08000000, interface fatal error
ata4: SError: { Handshk }
ata4.00: failed command: WRITE FPDMA QUEUED
ata4.00: cmd 61/40:70:98:2d:ea/00:00:85:00:00/40 tag 14 ncq dma 32768
out#012         res 40/00:7c:18:2e:ea/00:00:85:00:00/40 Emask 0x10
(ATA bus error)
ata4.00: status: { DRDY }
ata4.00: failed command: WRITE FPDMA QUEUED
ata4.00: cmd 61/40:78:18:2e:ea/00:00:85:00:00/40 tag 15 ncq dma 32768
out#012         res 40/00:7c:18:2e:ea/00:00:85:00:00/40 Emask 0x10
(ATA bus error)
ata4.00: status: { DRDY }
ata4: hard resetting link
ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete

The autoreboot only happens after the machine has already 'crashed'
and would have been otherwise unresponsive anyway.

On Wed, Dec 15, 2021 at 3:53 PM Wol <antlists@xxxxxxxxxxxxxxx> wrote:
>
> On 15/12/2021 16:45, Roger Heflin wrote:
> > If you cannot login to the machine via ssh, also try pinging it.  If
> > ping works but ssh does not either ssh died, or the machine is paging
> > so heavily that user space cannot respond in a reasonable time.
>
> "Unable to resolve host name 'thewolery'"
>
> Paging is EXTREMELY unlikely with 32GB ram ... :-)
> >
> > If the disk were an issue there should be messages about something in
> > the disk layer timing out, but it sounds like there aren't any of
> > those sorts of messages.  If it was a controller hardware/pci slot/hw
> > issue that will in some cases cause an immediate power cycle and boot
> > back up.
>
> Where do I look for those after a reboot? The system basically is
> completely unresponsive - so no it's not a reset or anything, the system
> just stops...
> >
> > You might also configure kdump, there should be doc's someplace on
> > configuring it for your distribution, once configured then test it
> > with "echo c > /proc/sysrq-trigger" and that should crash the machine
> > and leave you with a kernel core dump + dmesg from the time of the
> > crash.   Also if kdump is configured and working it will crash/dump
> > memory and typically boot back up automatically.
>
> I'll have to try it, although an autoreboot might not be a particularly
> good idea ...
> >
> > On Wed, Dec 15, 2021 at 3:54 AM Wols Lists <antlists@xxxxxxxxxxxxxxx> wrote:
> >>
> >> Don't know if this is off-topic or not, seeing as my system is very much
> >> reliant on raid ...
> >>
> >> But basically I'm seeing the system just stop responding. Typically it's
> >> in screensaver mode, I've got a blank screen, and it won't wake up. (I
> >> used to think it was something to do with Thunderbird, it mostly
> >> happened while TB was hammering the system, but no ...)
> >>
> >> Today, I had it happen while the system was idle but not in screensaver,
> >> I run xosview, and everything was clearly frozen - including xosview.
> >>
> >> As you might know, my stack is ext4 over lvm (over raid over
> >> dm-integrity for /home) over spinning rust.
> >>
> >> And I run gentoo/systemd - currently on the latest stable kernel afaik,
> >> 5.10.76-gentoo-r1 SMP x86_64.
> >>
> >> Any advice on how to debug a hang - basically I need something that'll
> >> just sit there so when it crashes (and I press the reset button to
> >> recover) I'll have some sort of trace. It would be nice to prove it's
> >> not the disk stack at fault ...
> >>
> >> Obviously, "set these options in the kernel" won't faze me ...
> >>
> >> Cheers,
> >> Wol