You could try the program memtester "dnf install memtester" "memtester 1g". This is a user level memory tester. I also have a server that occasionally dies. It started doing this late last year under Fedora27. I wasn't sure if it was a particular kernel change or hardware, but I replaced the motherboard/memory/CPU at that time as it was about 5 years old. However it has still crashed occasionally with the new hardware and with a fresh install of Fedora29. The latest /var/log/messages entry when it crashed was:
Jan 8 10:42:52 king mosquitto[1435]: 1546944172: New connection from 192.168.202.30 on port 1883. Jan 8 10:42:52 king mosquitto[1435]: 1546944172: New client connected from 192.168.202.30 as DVES_00B2F8 (c1, k10, u'DVES_USER'). Jan 8 10:43:13 king mosquitto[1435]: 1546944193: Client DVES_00B2F8 has exceeded timeout, disconnecting. Jan 8 10:43:13 king mosquitto[1435]: 1546944193: Socket error on client DVES_00B2F8, disconnecting. #########################################################################################################################################################################################################Jan 8 18:03:39 king kernel: microcode: microcode updated early to revision 0xc6, date = 2018-04-17 Jan 8 18:03:39 king kernel: Linux version 4.19.9-300.fc29.x86_64 (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 8.2.1 20181105 (Red Hat 8.2.1-5) (GCC)) #1 SMP Thu Dec 13 17:25:01 UTC 2018 Jan 8 18:03:39 king kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-4.19.9-300.fc29.x86_64 root=UUID=5d3007f8-fa92-4fe6-98a8-e812b680198f ro rd.auto LANG=en_GB.UTF-8 Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers' Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR' The "#" were in fact <nul> (0x00) bytes which is strange.
I do wonder if there is an obtuse kernel bug somewhere. This server has a Intel(R) Core(TM) i3-6100 CPU @ 3.70GHz and is doing DVB recording amoungst other work. Other servers I have though seem fine. Terry
On 06/01/2019 22:15, Alex wrote:
Hi, I have a fedora29 system in our colo that's a few years old now and just goes catatonic and stops responding after a few days. It's happened a few times now, even with different kernels, so I suspect it's a memory or hardware problem. Is it possible to run memtest without having physical access to the machine to insert a USB stick or CDROM? After the machine reboots (via IPMI access), there's nothing in the logs and no abrt-cli info on a kernel crash or other info I can find about why it died. What else can I do to troubleshoot this without having to drive to the colo to check on it? The last entry from journalctl just before it stopped responding was just a regular nrpe entry, unrelated to the crash. I've pasted the current dmesg output here: http://pasted.co/4b700ee1 Any ideas greatly appreciated. _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx |
_______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx