Re: [PATCH] drm/radeon: Avoid double gpu reset by adding a timeout on IB ring tests.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 31.01.2016 um 19:50 schrieb Matthew Dawson:
On Sunday, January 24, 2016 10:49:00 AM EST Christian König wrote:
Am 24.01.2016 um 07:18 schrieb Matthew Dawson:
When the radeon driver resets a gpu, it attempts to test whether all the
rings can successfully handle an IB.  If these rings fail to respond, the
process will wait forever.  Another gpu reset can't happen at this point,
as the current reset holds a lock required to do so.  Instead, make all
the IB tests run with a timeout, so the system can attempt to recover
in this case.

While this doesn't fix the underlying issue with card resets failing, it
gives the system a higher chance of recovering.  These timeouts have been
confirmed to help both a Tathi and Hawaii card recover after a gpu reset.

I haven't been able to test this on other cards, but everything compiles.
I wasn't sure what to use for a timeout value, so for now I've used
radeon_lockup_timeout.  When that is 0, the timeout functionality in this
patch is also disabled.

This also adds a new function, radeon_fence_wait_timeout, that behaves
like
radeon_fence_wait except allowing a different timeout value to be passed
to
radeon_fence_wait_seq_timeout.

Signed-off-by: Matthew Dawson <matthew@xxxxxxxxxxxxx>
Really good idea, but please don't use radeon_lockup_timeout for this
cause the 10 seconds default of that is way to long. Instead just use a
fixed, let's say 100ms timeout defined in radeon.h.
I originally tried 100ms, but I found that to be too short (my system would
just lockup completely (no network even)).  I found 1s is long enough, and
isn't so long to be annoying.  Would that be ok?

Yeah, 1s works as well. But that 100ms shouldn't be enough sounds like another bug to me.


Also don't define our own radeon_fence_wait_timeout, just use
fence_wait_timeout(&fence->base, RADEON_IB_TEST_TIMEOUT) for this.
I switched everything over to this, however the ib test always times out after
a gpu reset (on startup, everything is normal).  I'm not sure why
fence_wait_timeout isn't being signaled while radeon_fence_wait_seq_timeout
is.  I'm suspicious that either radeon_fence_enable_signaling is not doing
something necessary to get the fence completion to actual signal
radeon_fence_default_wait, or because we hold the exclusive lock during a
reset radeon_fence_check_lockup never runs and thus never triggers the fence
or enables some irq lines.

Is it worth digging through the interactions here to get dma-buf fences to
work correctly during a lockup, or would it be better to just keep
radeon_fence_wait_timeout?  If option 1 is preferred, I'm happy to learn but I
need some help learning how it is supposed to work.

In this case feel free to add the radeon_fence_wait_timeout. It's probably a good idea to get rid of the exclusive lock sooner or later in radeon as well, but that really is a long term project.

If you're really bored I can give you tons of input where to start with that.

Regards,
Christian.


Apart from that the idea looks really good to me.

Regards,
Christian.

Thanks,

Matthew

---

   drivers/gpu/drm/radeon/cik_sdma.c     |  3 ++-
   drivers/gpu/drm/radeon/r100.c         |  3 ++-
   drivers/gpu/drm/radeon/r600.c         |  3 ++-
   drivers/gpu/drm/radeon/r600_dma.c     |  3 ++-
   drivers/gpu/drm/radeon/radeon.h       |  1 +
   drivers/gpu/drm/radeon/radeon_fence.c | 25 ++++++++++++++++++++++---
   drivers/gpu/drm/radeon/radeon_vce.c   |  3 ++-
   drivers/gpu/drm/radeon/uvd_v1_0.c     |  3 ++-
   8 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/radeon/cik_sdma.c
b/drivers/gpu/drm/radeon/cik_sdma.c index d16f2ee..014f5ca 100644
--- a/drivers/gpu/drm/radeon/cik_sdma.c
+++ b/drivers/gpu/drm/radeon/cik_sdma.c
@@ -737,7 +737,8 @@ int cik_sdma_ib_test(struct radeon_device *rdev,
struct radeon_ring *ring)>
   		DRM_ERROR("radeon: failed to schedule ib (%d).\n", r);
   		return r;
   	
   	}

-	r = radeon_fence_wait(ib.fence, false);
+	r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies(
+		(radeon_lockup_timeout) ? radeon_lockup_timeout :
MAX_SCHEDULE_TIMEOUT));>
   	if (r) {
   	
   		DRM_ERROR("radeon: fence wait failed (%d).\n", r);
   		return r;

diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index 5eae0a8..5386c09 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -3732,7 +3732,8 @@ int r100_ib_test(struct radeon_device *rdev, struct
radeon_ring *ring)>
   		DRM_ERROR("radeon: failed to schedule ib (%d).\n", r);
   		goto free_ib;
   	
   	}

-	r = radeon_fence_wait(ib.fence, false);
+	r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies(
+		(radeon_lockup_timeout) ? radeon_lockup_timeout :
MAX_SCHEDULE_TIMEOUT));>
   	if (r) {
   	
   		DRM_ERROR("radeon: fence wait failed (%d).\n", r);
   		goto free_ib;

diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index cc2fdf0..8fbe5d7 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -3381,7 +3381,8 @@ int r600_ib_test(struct radeon_device *rdev, struct
radeon_ring *ring)>
   		DRM_ERROR("radeon: failed to schedule ib (%d).\n", r);
   		goto free_ib;
   	
   	}

-	r = radeon_fence_wait(ib.fence, false);
+	r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies(
+		(radeon_lockup_timeout) ? radeon_lockup_timeout :
MAX_SCHEDULE_TIMEOUT));>
   	if (r) {
   	
   		DRM_ERROR("radeon: fence wait failed (%d).\n", r);
   		goto free_ib;

diff --git a/drivers/gpu/drm/radeon/r600_dma.c
b/drivers/gpu/drm/radeon/r600_dma.c index d2dd29a..3ff614d 100644
--- a/drivers/gpu/drm/radeon/r600_dma.c
+++ b/drivers/gpu/drm/radeon/r600_dma.c
@@ -368,7 +368,8 @@ int r600_dma_ib_test(struct radeon_device *rdev,
struct radeon_ring *ring)>
   		DRM_ERROR("radeon: failed to schedule ib (%d).\n", r);
   		return r;
   	
   	}

-	r = radeon_fence_wait(ib.fence, false);
+	r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies(
+		(radeon_lockup_timeout) ? radeon_lockup_timeout :
MAX_SCHEDULE_TIMEOUT));>
   	if (r) {
   	
   		DRM_ERROR("radeon: fence wait failed (%d).\n", r);
   		return r;

diff --git a/drivers/gpu/drm/radeon/radeon.h
b/drivers/gpu/drm/radeon/radeon.h index 5ae6db9..0b698b6 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -382,6 +382,7 @@ void radeon_fence_driver_force_completion(struct
radeon_device *rdev, int ring);>
   int radeon_fence_emit(struct radeon_device *rdev, struct radeon_fence
   **fence, int ring); void radeon_fence_process(struct radeon_device
   *rdev, int ring);
   bool radeon_fence_signaled(struct radeon_fence *fence);

+int radeon_fence_wait_timeout(struct radeon_fence *fence, bool
interruptible, long timeout);>
   int radeon_fence_wait(struct radeon_fence *fence, bool interruptible);
   int radeon_fence_wait_next(struct radeon_device *rdev, int ring);
   int radeon_fence_wait_empty(struct radeon_device *rdev, int ring);

diff --git a/drivers/gpu/drm/radeon/radeon_fence.c
b/drivers/gpu/drm/radeon/radeon_fence.c index 05815c4..9fec805 100644
--- a/drivers/gpu/drm/radeon/radeon_fence.c
+++ b/drivers/gpu/drm/radeon/radeon_fence.c
@@ -527,7 +527,7 @@ static long radeon_fence_wait_seq_timeout(struct
radeon_device *rdev,>
   }
/**

- * radeon_fence_wait - wait for a fence to signal
+ * radeon_fence_wait_timeout - wait for a fence to signal with timeout

    *
    * @fence: radeon fence object
    * @intr: use interruptible sleep

@@ -535,9 +535,10 @@ static long radeon_fence_wait_seq_timeout(struct
radeon_device *rdev,>
    * Wait for the requested fence to signal (all asics).
    * @intr selects whether to use interruptable (true) or
    non-interruptable
    * (false) sleep when waiting for the fence.

+ * @timeout: maximum time to wait, or MAX_SCHEDULE_TIMEOUT for infinite
wait>
    * Returns 0 if the fence has passed, error for all other cases.
    */

-int radeon_fence_wait(struct radeon_fence *fence, bool intr)
+int radeon_fence_wait_timeout(struct radeon_fence *fence, bool intr, long
timeout)>
   {
uint64_t seq[RADEON_NUM_RINGS] = {};
   	long r;

@@ -552,9 +553,11 @@ int radeon_fence_wait(struct radeon_fence *fence,
bool intr)>
   		return fence_wait(&fence->base, intr);
   	
   	seq[fence->ring] = fence->seq;

-	r = radeon_fence_wait_seq_timeout(fence->rdev, seq, intr,
MAX_SCHEDULE_TIMEOUT); +	r = radeon_fence_wait_seq_timeout(fence->rdev,
seq, intr, timeout);>
   	if (r < 0) {
   	
   		return r;

+	} else if (r == 0) {
+		return -ETIMEDOUT;

   	}
   	
   	r = fence_signal(&fence->base);

@@ -564,6 +567,22 @@ int radeon_fence_wait(struct radeon_fence *fence,
bool intr)>
   }
/**

+ * radeon_fence_wait - wait for a fence to signal
+ *
+ * @fence: radeon fence object
+ * @intr: use interruptible sleep
+ *
+ * Wait for the requested fence to signal (all asics).
+ * @intr selects whether to use interruptable (true) or non-interruptable
+ * (false) sleep when waiting for the fence.
+ * Returns 0 if the fence has passed, error for all other cases.
+ */
+int radeon_fence_wait(struct radeon_fence *fence, bool intr)
+{
+	return radeon_fence_wait_timeout(fence, intr, MAX_SCHEDULE_TIMEOUT);
+}
+
+/**

    * radeon_fence_wait_any - wait for a fence to signal on any ring
    *
    * @rdev: radeon device pointer

diff --git a/drivers/gpu/drm/radeon/radeon_vce.c
b/drivers/gpu/drm/radeon/radeon_vce.c index 7eb1ae7..7d80dad 100644
--- a/drivers/gpu/drm/radeon/radeon_vce.c
+++ b/drivers/gpu/drm/radeon/radeon_vce.c
@@ -810,7 +810,8 @@ int radeon_vce_ib_test(struct radeon_device *rdev,
struct radeon_ring *ring)>
   		goto error;
   	
   	}

-	r = radeon_fence_wait(fence, false);
+	r = radeon_fence_wait_timeout(fence, false, msecs_to_jiffies(
+		(radeon_lockup_timeout) ? radeon_lockup_timeout :
MAX_SCHEDULE_TIMEOUT));>
   	if (r) {
   	
   		DRM_ERROR("radeon: fence wait failed (%d).\n", r);
   	
   	} else {

diff --git a/drivers/gpu/drm/radeon/uvd_v1_0.c
b/drivers/gpu/drm/radeon/uvd_v1_0.c index c6b1cbc..35caa89 100644
--- a/drivers/gpu/drm/radeon/uvd_v1_0.c
+++ b/drivers/gpu/drm/radeon/uvd_v1_0.c
@@ -522,7 +522,8 @@ int uvd_v1_0_ib_test(struct radeon_device *rdev,
struct radeon_ring *ring)>
   		goto error;
   	
   	}

-	r = radeon_fence_wait(fence, false);
+	r = radeon_fence_wait_timeout(fence, false, msecs_to_jiffies(
+		(radeon_lockup_timeout) ? radeon_lockup_timeout :
MAX_SCHEDULE_TIMEOUT));>
   	if (r) {
   	
   		DRM_ERROR("radeon: fence wait failed (%d).\n", r);
   		goto error;


_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux