Well I did pull out the nvidia gpu. One reason I suspected this is that I had seen many messages about kernel oopses for the nouveau driver. This gpu is just for ML, the console is connected to a motherboard builtin VGA (and normally all use is remote anyway). The last time the system became unusable is after I did systemctl restart display-manager, so that made me suspicious. If I need to use the GPU I'll try to install nvidia driver first and disable nouveau. The system seems to be working normally now, although I have not tried rebooting. Thanks for you suggestions! On Fri, Feb 5, 2021 at 11:14 AM Roger Heflin <rogerheflin@xxxxxxxxx> wrote: > > if it was failing/weak power supply it would just crash, nothing slows > down nicely when that happens. > > Nvidia GPU will usually crash the hardware if it overstresses the > power supply and will also crash if it goes bad. > > Now overheating may cause the cpus to throttle and that may make the > machine feel rather sluggish, though I would not expect minutes, > unless there is normally a large cpu load on the machine. > > Install a package called perf, and next time see if you can run "perf > top" that will show internally what calls the kernel processes may or > may not be doing internally and how much time they are spending. Note > that on machines with large counts of cores that the ondemand power > savings settings that adjusts mhz is expensive to run. That will show > as significant system time, and that will show in perf top. If you > don't have it installed, install sar or a similar tool that will give > you some ideas of what the system saw leading up to the issues, and > during the issues. Usually I set sar at a 1minute sample vs the > default 10min sample, that change is done via systemd, google knows > how. The other items that will crush a machine and aren't obvious > are applications creating processes at a high rate, and/or > applications mapping and unmapping a lot of memory, that will also > show as system time, and will have a footprint in perf top. So note > when it is running good the ratio of user to system time (user being > 5x system or higher is what is normal, if it drops to much below 5 > often indicates one of the above issues). sar will show > cpu(user/system/...), disk response, and a lot of raw network and > tcp/udp stats, and process created rates and memory allocation and > paging rates. > > > > On Fri, Feb 5, 2021 at 6:09 AM Neal Becker <ndbecker2@xxxxxxxxx> wrote: > > > > I've been running F32 on a shiny new amd dual epyc workstation for > > about 1 year. The system is now remote to me and not convenient to > > access. > > > > About 1 week ago the system became unresponsive. I noticed errors > > logged about I/O errors, so I guessed it was an issue with the SSD. I > > went there and replaced the SSD with a shiny new samsung 1tb. > > Reinstalled F33 and got my vpns going so I could access again from > > home. > > > > But things are acting very strangely. Install was lightning fast. > > But after a while the machine becomes unusable. Any command takes > > minutes to react. I am unable to reboot it. sudo reboot after a very > > long time does nothing. > > I don't see anything interesting in /var/log/messages (I installed rsyslog). > > When I can eventually get top to run, I see systemd is in D state. > > There is plenty of free memory, and the machine has 64GB. > > > > I'm going to visit again and this time yank out the nvidia gpu. This > > is just a wild guess based on 1) it isn't critical for use right now > > 2) it places a load on the power supply just in case that's the issue > > 3) it's the only thing I can think to try. > > > > Just wondering if anyone has any thoughts on how to troubleshoot this. _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx