Re: Server goes catatonic after a few days

Terry Barnaby <terry1@xxxxxxxxxxx> · Tue, 8 Jan 2019 18:40:57 +0000

    You could try the program memtester "dnf install memtester"
      "memtester 1g". This is a user level memory tester.
    I also have a server that occasionally dies. It started doing
      this late last year under Fedora27. I wasn't sure if it was a
      particular kernel change or hardware, but I replaced the
      motherboard/memory/CPU at that time as it was about 5 years old.
      However it has still crashed occasionally with the new hardware
      and with a fresh install of Fedora29.
    The latest /var/log/messages entry when it crashed was:

      Jan 8 10:42:52 king mosquitto[1435]: 1546944172: New connection
      from
      192.168.202.30 on port 1883.
    Jan 8 10:42:52 king
      mosquitto[1435]: 1546944172: New client connected from
      192.168.202.30
      as DVES_00B2F8 (c1, k10, u'DVES_USER').
    Jan 8 10:43:13 king
      mosquitto[1435]: 1546944193: Client DVES_00B2F8 has exceeded
      timeout,
      disconnecting.
    Jan 8 10:43:13 king
      mosquitto[1435]: 1546944193: Socket error on client DVES_00B2F8,
      disconnecting.
    #########################################################################################################################################################################################################Jan
      8 18:03:39 king kernel: microcode: microcode updated early to
      revision 0xc6, date = 2018-04-17
    Jan 8 18:03:39 king
      kernel: Linux version 4.19.9-300.fc29.x86_64
      (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 8.2.1
      20181105 (Red Hat 8.2.1-5) (GCC)) #1 SMP Thu Dec 13 17:25:01 UTC
      2018
    Jan 8 18:03:39 king
      kernel: Command line:
      BOOT_IMAGE=/boot/vmlinuz-4.19.9-300.fc29.x86_64
      root=UUID=5d3007f8-fa92-4fe6-98a8-e812b680198f ro rd.auto
      LANG=en_GB.UTF-8
    Jan 8 18:03:39 king
      kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating
      point
      registers'
    Jan 8 18:03:39 king
      kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
    Jan 8 18:03:39 king
      kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
    Jan 8 18:03:39 king
      kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds
      registers'
    Jan 8 18:03:39 king
      kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
    The "#" were in
      fact <nul> (0x00) bytes which is strange.

    I do wonder if
      there is an obtuse kernel bug somewhere. This server has a
      Intel(R) Core(TM) i3-6100 CPU @ 3.70GHz and is doing DVB recording
      amoungst other work. Other servers I have though seem fine.
    Terry

    On 06/01/2019 22:15, Alex wrote:

      Hi,
I have a fedora29 system in our colo that's a few years old now and
just goes catatonic and stops responding after a few days. It's
happened a few times now, even with different kernels, so I suspect
it's a memory or hardware problem.

Is it possible to run memtest without having physical access to the
machine to insert a USB stick or CDROM?

After the machine reboots (via IPMI access), there's nothing in the
logs and no abrt-cli info on a kernel crash or other info I can find
about why it died.

What else can I do to troubleshoot this without having to drive to the
colo to check on it?

The last entry from journalctl just before it stopped responding was
just a regular nrpe entry, unrelated to the crash.

I've pasted the current dmesg output here:
http://pasted.co/4b700ee1

Any ideas greatly appreciated.
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx

_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx