Re: Debugging system hangs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 15/12/2021 16:45, Roger Heflin wrote:
If you cannot login to the machine via ssh, also try pinging it.  If
ping works but ssh does not either ssh died, or the machine is paging
so heavily that user space cannot respond in a reasonable time.

"Unable to resolve host name 'thewolery'"

Paging is EXTREMELY unlikely with 32GB ram ... :-)

If the disk were an issue there should be messages about something in
the disk layer timing out, but it sounds like there aren't any of
those sorts of messages.  If it was a controller hardware/pci slot/hw
issue that will in some cases cause an immediate power cycle and boot
back up.

Where do I look for those after a reboot? The system basically is completely unresponsive - so no it's not a reset or anything, the system just stops...

You might also configure kdump, there should be doc's someplace on
configuring it for your distribution, once configured then test it
with "echo c > /proc/sysrq-trigger" and that should crash the machine
and leave you with a kernel core dump + dmesg from the time of the
crash.   Also if kdump is configured and working it will crash/dump
memory and typically boot back up automatically.

I'll have to try it, although an autoreboot might not be a particularly good idea ...

On Wed, Dec 15, 2021 at 3:54 AM Wols Lists <antlists@xxxxxxxxxxxxxxx> wrote:

Don't know if this is off-topic or not, seeing as my system is very much
reliant on raid ...

But basically I'm seeing the system just stop responding. Typically it's
in screensaver mode, I've got a blank screen, and it won't wake up. (I
used to think it was something to do with Thunderbird, it mostly
happened while TB was hammering the system, but no ...)

Today, I had it happen while the system was idle but not in screensaver,
I run xosview, and everything was clearly frozen - including xosview.

As you might know, my stack is ext4 over lvm (over raid over
dm-integrity for /home) over spinning rust.

And I run gentoo/systemd - currently on the latest stable kernel afaik,
5.10.76-gentoo-r1 SMP x86_64.

Any advice on how to debug a hang - basically I need something that'll
just sit there so when it crashes (and I press the reset button to
recover) I'll have some sort of trace. It would be nice to prove it's
not the disk stack at fault ...

Obviously, "set these options in the kernel" won't faze me ...

Cheers,
Wol



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux