Re: Crash and automatical reboot when using the NVIDIA card

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



From: David McGiven <davidmcgivenn@xxxxxxxxx>

> I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel
> : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
> A few minutes after using the GPU for doing some HPC calculations, the
> server crashes and reboots itself. This is happening every time. I know it
> will be rebooted but I don't know when. Sometimes it's 20 minutes after
> starting using it. Sometimes it's 2 hours.
> If I unplug the GPU card and put some stress on the server, it works ok. So
> I suspect there's a bug in the kernel/nvidia driver.
> I can't find any messages on /var/log/messages.

Did you check the IPMI logs?
First thing that comes to my mind would be overheating.
Maybe dump the temperatures every minute to a log file and after next reboot, 
check if there is a pic...
Or maybe a freeze + the watchdog kicking in?

JD
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos




[Index of Archives]     [CentOS]     [CentOS Announce]     [CentOS Development]     [CentOS ARM Devel]     [CentOS Docs]     [CentOS Virtualization]     [Carrier Grade Linux]     [Linux Media]     [Asterisk]     [DCCP]     [Netdev]     [Xorg]     [Linux USB]
  Powered by Linux