Hi Daniel, On 2019-05-12 9:44 p.m., Daniel Kasak wrote: > [CAUTION: External Email] > Hi all. I had version 2.2.0 of the ROCM stack running on a 5.0.x and > 5.1.0 kernel. Things were going great with various boinc GPU tasks. > But there is a setiathome GPU task which reliably gives me a hard > lockup within about 30 minutes of running. I actually had to do *two* > emergency re-installs over the past week. Sorry to hear about your trouble. Do you have a second computer you can use to remote login into your system? Chances are that it's still responsive and only the screen is frozen. Also, you could try booting in console mode (without an xserver). The console usually still works even when the GPU compute units or SDMA engines are hanging. If you manage to do an emergency reboot with sysrq (remount-RO and reboot), you should see the kernel log of your previous session in /var/log. On Ubuntu it's in /var/log/kern.log. Not sure where it is on Gentoo. There is a good chance the log contains helpful information (e.g. if the driver detected a hang but failed to reset the GPU, or maybe a driver bug that leads to a deadlock or kernel panic). > Perhaps part of this was my fault ( running btrfs with lzo compression > on my root partition ... ). But absolutely part of this was the hard > lockups. I've tested all kinds of other things ( eg rebuilding lots of > stuff under Gentoo ) ... I don't have a general stability issue even > under hours of high load. But after restarting boinc with that same > setiathome task ... <bang>! > > If someone wants me to sacrifice another installation, they can point > me to instructions for trying to gather more information. If you want to risk another installation, it may be a good idea to do it on a spare hard drive, or a spare partition on your existing hard drive. Also, use a more conventional choice of file system. A simple ext4 is pretty robust in my experience. We get hard lockups all the time. I usually only reinstall my system for big OS upgrades or if I'm stupid and mess something up myself. Which GPU are you using? There are some things you could try to narrow down the cause of your problem. 1. Monitor GPU temperature while running setiathome 2. If you're building your own kernel, enable some helpful kernel debug options that can provide very helpful diagnostic info: lock debugging, memory debugging, lockup/hang debugging 3. Try running with lower GPU clocks (rocm-smi --setperflevel low). If that fixes it, you may have inadequate cooling or power supply 4. Try running in console mode (without Xserver or other graphical UI running). If that fixes it, there may be a bad interaction between graphics and compute 5. Try updating your firmware. The DKMS package included in our ROCm releases includes the latest firmware. You should be able to extract it from there and drop it into /lib/firmware/amdgpu 6. Try to find a regression point. Is there any known version of ROCm or the kernel where it worked correctly? Regards, Felix > > Anyway ... perhaps more work around detecting and recovering from GPU > lockups is in order? > > Dan > > _______________________________________________ > amd-gfx mailing list > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx