George Avrunin writes:
[..]
I don't have any idea what's going on and it's very inconvenient (not to mention strongly discouraged by the powers that be) to have to keep going on campus to restart the machine. So I'd be very grateful for suggests about how to figure this out, or at least stop it from happening again.
The capsule summary here is that the system appears to lock up under high I/O; either disk or network I/O. Doing a dnf upgrade puts a heavy load on both disk and network I/O. Network I/O only in the case of the update itself having to go out and download the updates from the repos. If all stuff's already downloaded, it's mostly just disk I/O.
You can prove that theory by simulating some load yourself. Something like dd if=/dev/urandom of=/tmp/junk$$ bs=1M count=100 &Kick this off a dozen times, or so, to write a gig worth of junk into /tmp (presuming there's space for it).
If this locks up the machine, there you go. If not, and you think your dnf upgrade was downloading stuff, try generating some network load. You'll have to have some bandwidth available yourself. You can take the dozen files of junk, put them in /var/www/html (presuming that apache is running), and wget them all, in parallel, off this machine from some other place.
For extra credit you can try generating both disk and network load.If this turns out to reliably lock up this particular bit of hardware, there you go. What can you do about it? Very little. It's going to be either failing hardware (hard drive, power supply, or RAM), or a kernel bug. Looking up the spec sheet for your box, looks like both spinning rust and SSDs are an available option. You didn't say which one you have, but if your hard drive are spinning rust, that's the most like point of failure. Pretty much the only easily-accessible clue would be SMART diagnostics on the hard drive(s). See if there's anything there that tells you that the hard drive is on its last breath. The next easiest accessible clue is only available if you're physically at the machine, that would be a RAM tester. Do Fedora live images still include a memtest option, does anyone know?
You could be hitting a kernel bug. In the old days, I was rigging up a cross over on my PCs serial port, and configuring the kernel with a serial console, then capturing kernel OOPSes on the other machine, over the serial port. RS-232 ports are long gone. Have some vague recollection of serial over USB being an option. Another option worth exploring would be look into remote syslogging. Maybe the kernel can eke out an extra packet or two, to a remote syslog, before crashing.
But at least confirming that you can reliably reproduce a lockup by simulating high disk or network I/O is better than nothing.
Attachment:
pgpvPemOUbA7O.pgp
Description: PGP signature
_______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx