Re: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Christian König <deathsimple@xxxxxxxxxxx> · Mon, 07 Oct 2013 13:08:13 +0200

First of all, I can't complain about the reliability of the hardware
GPU reset. It's mostly the kernel driver that happens to run into a
deadlock at the same time.

Alex and I spend quite some time on making this reliable again after 
activating more rings and adding VM support. The main problem is that I 
couldn't figure out where the CPU deadlock comes from, cause I couldn't 
reliable reproduce the issue.

What is the content of /proc/<pid of X server>/task/*/stack and 
sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in 
the deadlock situation?

I'm pretty sure that we nearly always have a problem when two threads 
are waiting for fences on one of them detects that we have a lockup 
while the other one keeps holding the exclusive lock. Signaling all 
fences might work around that problem, but it probably would be better 
to fix the underlying issue.

Going to take a deeper look into it.

Christian.

Am 03.10.2013 02:45, schrieb Marek Olšák:
First of all, I can't complain about the reliability of the hardware
GPU reset. It's mostly the kernel driver that happens to run into a
deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset
completes successfully according to dmesg, but X doesn't respond. I
can move the cursor on the screen, but I can't do anything else and
the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can
easily reproduce this, because it's the most common reason why a GPU
lockup leads to frozen X. The GPU actually recovers, but X is hung. I
can't tell whether the fences are just not signalled or whether there
is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful
(except for extreme cases, see below). With a small enough lockup
timeout, the lockups are just a minor annoyance and I thought I could
get through a piglit run just with a few tens or hundreds of GPU
resets...

A different type of deadlock showed up, though it needs a lot of
concurrently-running apps like piglit. What happened is that the
kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due
to a GPU hang while holding onto the exclusive lock, and another
thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got
a few lockups while writing this email and I'm glad I didn't have to
reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König <deathsimple@xxxxxxxxxxx> wrote:
Possible, but I would rather guess that this doesn't work because the IB test runs into a deadlock situation and so the GPU reset never fully completes.

Can you reproduce the problem?

If you want to make GPU resets more reliable I would rather suggest to remove the ring lock dependency.
Then we should try to give all the fence wait functions a (reliable) timeout and move reset handling a layer up into the ioctl functions. But for this you need to rip out the old PM code first.

Christian.

Marek Olšák <maraeo@xxxxxxxxx> schrieb:

I'm afraid signalling the fences with an IB test is not reliable.

Marek

On Wed, Oct 2, 2013 at 3:52 PM, Christian König <deathsimple@xxxxxxxxxxx> wrote:
NAK, after recovering from a lockup the first thing we do is signalling all remaining fences with an IB test.

If we don't recover we indeed signal all fences manually.

Signalling all fences regardless of the outcome of the reset creates problems with both types of partial resets.

Christian.

Marek Olšák <maraeo@xxxxxxxxx> schrieb:

From: Marek Olšák <marek.olsak@xxxxxxx>

After a lockup, fences are not signalled sometimes, causing
the GEM_WAIT_IDLE ioctl to never return, which sometimes results
in an X server freeze.

This fixes only one of many deadlocks which can occur during a lockup.

Signed-off-by: Marek Olšák <marek.olsak@xxxxxxx>
---
drivers/gpu/drm/radeon/radeon_device.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index 841d0e0..7b97baa 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device *rdev)
       radeon_save_bios_scratch_regs(rdev);
       /* block TTM */
       resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
+
+      mutex_lock(&rdev->ring_lock);
+      radeon_fence_driver_force_completion(rdev);
+      mutex_unlock(&rdev->ring_lock);
+
       radeon_pm_suspend(rdev);
       radeon_suspend(rdev);

--
1.8.1.2

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel