Re: Maybe offline updates aren't a bad idea

Roger Heflin <rogerheflin@xxxxxxxxx> · Sat, 31 Jul 2021 10:46:12 -0500

Adding to what Chris suggests.

When ssh fails, always ping the ip address.  If the ping responds then
the kernel is up in some state (during heavy paging/deadlocks ping
generally responds if the kernel is still running and has not
crashed).  If ping does not respond either the network has died
(typically the network does not usually stop responding unless someone
screws up and takes it down--though I do know of at least one network
card crash that I have seen drop the network many times--but it is
easy to diag since it logs the issue) or the kernel has crashed
because of something.

Enabling crash dumps might be a good idea, if the crash does not
collect and/or try to collect and the node boots back up then that is
often a sign of a hardware fault that forced an immediate reset of the
hardware.

On Sat, Jul 31, 2021 at 1:01 AM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Jul 30, 2021 at 2:00 PM Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
> >
> > If it was just a plasma crash, then ssh and/or the alt keys would have
> > worked to switch terminals.
> >
> > Details said neither worked.  The kernel and/or a significant part of
> > userspace was deadlocked and/or crashed.
>
> I wonder if logs contain anything... i.e. from the boot following the
> failed update, use journalctl -b-1 and if it's 5 boots back use -b-5
>
> It might have the start of the problem anyway. I also suspect a
> deadlock. It can make it seem like ssh is dead but it's just super
> slow. Or may even time out unless a session has already started.
> Workstation edition and KDE spin have improved resource control, which
> is a work in-progress (also on KDE you will need to install
> uresourced). This attempts to ensure minimum resources are available
> for the desktop to be responsive. One possible limitation is IO
> pressure, we're not quite there yet implementing IO isolation. A
> deadlock though is a different problem so the resource control work
> wouldn't help.
>
> If you ever see "task xxx:yyy blocked for more than 120 seconds" it's
> best to issue sysrq+w (i.e. echo w > /proc/sysrq-trigger) to dump
> extra debugging information into the kernel message buffer, and then
> file a bug attaching dmesg.
>
>
> --
> Chris Murphy
> _______________________________________________
> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure