Re: workstation has become ill

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Well I did pull out the nvidia gpu.  One reason I suspected this is
that I had seen many messages about kernel oopses for the nouveau
driver.  This gpu is just for ML, the console is connected to a
motherboard builtin VGA (and normally all use is remote anyway).  The
last time the system became unusable is after I did
systemctl restart display-manager, so that made me suspicious.
If I need to use the GPU I'll try to install nvidia driver first and
disable nouveau.
The system seems to be working normally now, although I have not tried
rebooting.
Thanks for you suggestions!

On Fri, Feb 5, 2021 at 11:14 AM Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
>
> if it was failing/weak power supply it would just crash, nothing slows
> down nicely when that happens.
>
> Nvidia GPU will usually crash the hardware if it overstresses the
> power supply and will also crash if it goes bad.
>
> Now overheating may cause the cpus to throttle and that may make the
> machine feel rather sluggish, though I would not expect minutes,
> unless there is normally a large cpu load on the machine.
>
> Install a package called perf, and next time see if you can run "perf
> top" that will show internally what calls the kernel processes may or
> may not be doing internally and how much time they are spending.  Note
> that on machines with large counts of cores that the ondemand power
> savings settings that adjusts mhz is expensive to run.  That will show
> as significant system time, and that will show in perf top.    If you
> don't have it installed, install sar or a similar tool that will give
> you some ideas of what the system saw leading up to the issues, and
> during the issues.  Usually I set sar at a 1minute sample vs the
> default 10min sample, that change is done via systemd, google knows
> how.   The other items that will crush a machine and aren't obvious
> are applications creating processes at a high rate, and/or
> applications mapping and unmapping a lot of memory, that will also
> show as system time, and will have a footprint in perf top.  So note
> when it is running good the ratio of user to system time (user being
> 5x system or higher is what is normal, if it drops to much below 5
> often indicates one of the above issues).    sar will show
> cpu(user/system/...), disk response, and a lot of raw network and
> tcp/udp stats, and process created rates and memory allocation and
> paging rates.
>
>
>
> On Fri, Feb 5, 2021 at 6:09 AM Neal Becker <ndbecker2@xxxxxxxxx> wrote:
> >
> > I've been running F32 on a shiny new amd dual epyc workstation for
> > about 1 year.  The system is now remote to me and not convenient to
> > access.
> >
> > About 1 week ago the system became unresponsive.  I noticed errors
> > logged about I/O errors, so I guessed it was an issue with the SSD.  I
> > went there and replaced the SSD with a shiny new samsung 1tb.
> > Reinstalled F33 and got my vpns going so I could access again from
> > home.
> >
> > But things are acting very strangely.  Install was lightning fast.
> > But after a while the machine becomes unusable.  Any command takes
> > minutes to react.  I am unable to reboot it.  sudo reboot after a very
> > long time does nothing.
> > I don't see anything interesting in /var/log/messages (I installed rsyslog).
> > When I can eventually get top to run, I see systemd is in D state.
> > There is plenty of free memory, and the machine has 64GB.
> >
> > I'm going to visit again and this time yank out the nvidia gpu.  This
> > is just a wild guess based on 1) it isn't critical for use right now
> > 2) it places a load on the power supply just in case that's the issue
> > 3) it's the only thing I can think to try.
> >
> > Just wondering if anyone has any thoughts on how to troubleshoot this.
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx



[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux