[PATCH] drm/radeon: disable any GPU activity after unrecovered lockup v5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Jerome Glisse <jglisse@xxxxxxxxxx>

After unrecovered GPU lockup avoid any GPU activities to avoid
things like kernel segfault and alike to happen in any of the
path that assume hw is working.

The segfault is due to PCIE vram gart table being unmapped after
suspend in the GPU reset path. To avoid segault to happen and to
avoid further GPU activity if unsuccessful at reseting GPU we
use the accel_working boolean to transform ttm activities into
noop. It does not impact the module load path because in that
path ttm have an empty schedule queue and accel_working will be
set to true as soon as the gart table is in valid state. Because
ttm might have work queued it is better to use the accel working
then disabling radeon_bo ioctl.

To trigger the segfault launch a program that repeatly create bo
in ttm and let it run in background, then trigger gpu lockup from
another process.

This patch also for video mode restoring on r1xx,r2xx,r3xx,r4xx,
r5xx,rs4xx,rs6xx GPU even if GPU reset fail. When GPU reset fails
it is very likely (so far i never had it not working) that the
modesetting part of the GPU is still alive. So we can have a
chance to get kernel backtrace or other debugging informations
on the screen if we always restore the video mode.

v2: fix spelling error and disable accel before suspend and reenable
    it after pcie gart initialization to be even more cautious about
    possible segfault. Improve commit message
v3: Improve commit message to describe the video mode restoring no
    matter what.
v4: Avoid issue after successfull GPU lockup recovery. Don't do noop
    ttm move, instead report error if move needs bind or unbind or
    fallback to memcpy. Don't restrict new bo domain instead refuse
    to create new bo if gpu reset failed. Disable accel working
    in gart vram table unpin thus we don't change the behavior of
    the suspend path.
v5: Avoid set domain to also trigger noop bind/unbind, instead force
    it to wait for GPU reset to go through or return failure if
    gpu reset fails.

cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Jerome Glisse <jglisse@xxxxxxxxxx>
---
 drivers/gpu/drm/radeon/evergreen.c     |    2 +-
 drivers/gpu/drm/radeon/ni.c            |    2 +-
 drivers/gpu/drm/radeon/r300.c          |    2 +-
 drivers/gpu/drm/radeon/r520.c          |    2 +-
 drivers/gpu/drm/radeon/r600.c          |    2 +-
 drivers/gpu/drm/radeon/radeon_device.c |    8 +++++---
 drivers/gpu/drm/radeon/radeon_gart.c   |    1 +
 drivers/gpu/drm/radeon/radeon_gem.c    |   33 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/radeon/radeon_ttm.c    |   23 ++++++++++++++++++++++
 drivers/gpu/drm/radeon/rs400.c         |    2 +-
 drivers/gpu/drm/radeon/rs600.c         |    2 +-
 drivers/gpu/drm/radeon/rs690.c         |    2 +-
 drivers/gpu/drm/radeon/rv515.c         |    2 +-
 drivers/gpu/drm/radeon/rv770.c         |    2 +-
 drivers/gpu/drm/radeon/si.c            |    2 +-
 15 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c b/drivers/gpu/drm/radeon/evergreen.c
index c3073f7..5f154e3 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -3071,6 +3071,7 @@ static int evergreen_startup(struct radeon_device *rdev)
 		if (r)
 			return r;
 	}
+	rdev->accel_working = true;
 	evergreen_gpu_init(rdev);
 
 	r = evergreen_blit_init(rdev);
@@ -3145,7 +3146,6 @@ int evergreen_resume(struct radeon_device *rdev)
 	/* post card */
 	atom_asic_init(rdev->mode_info.atom_context);
 
-	rdev->accel_working = true;
 	r = evergreen_startup(rdev);
 	if (r) {
 		DRM_ERROR("evergreen startup failed on resume\n");
diff --git a/drivers/gpu/drm/radeon/ni.c b/drivers/gpu/drm/radeon/ni.c
index dc2e34d..486faa8 100644
--- a/drivers/gpu/drm/radeon/ni.c
+++ b/drivers/gpu/drm/radeon/ni.c
@@ -1245,6 +1245,7 @@ static int cayman_startup(struct radeon_device *rdev)
 	r = cayman_pcie_gart_enable(rdev);
 	if (r)
 		return r;
+	rdev->accel_working = true;
 	cayman_gpu_init(rdev);
 
 	r = evergreen_blit_init(rdev);
@@ -1337,7 +1338,6 @@ int cayman_resume(struct radeon_device *rdev)
 	/* post card */
 	atom_asic_init(rdev->mode_info.atom_context);
 
-	rdev->accel_working = true;
 	r = cayman_startup(rdev);
 	if (r) {
 		DRM_ERROR("cayman startup failed on resume\n");
diff --git a/drivers/gpu/drm/radeon/r300.c b/drivers/gpu/drm/radeon/r300.c
index 97722a3..206ac1f 100644
--- a/drivers/gpu/drm/radeon/r300.c
+++ b/drivers/gpu/drm/radeon/r300.c
@@ -1358,6 +1358,7 @@ static int r300_startup(struct radeon_device *rdev)
 		if (r)
 			return r;
 	}
+	rdev->accel_working = true;
 
 	if (rdev->family == CHIP_R300 ||
 	    rdev->family == CHIP_R350 ||
@@ -1426,7 +1427,6 @@ int r300_resume(struct radeon_device *rdev)
 	/* Initialize surface registers */
 	radeon_surface_init(rdev);
 
-	rdev->accel_working = true;
 	r = r300_startup(rdev);
 	if (r) {
 		rdev->accel_working = false;
diff --git a/drivers/gpu/drm/radeon/r520.c b/drivers/gpu/drm/radeon/r520.c
index b5cf837..6409eb0 100644
--- a/drivers/gpu/drm/radeon/r520.c
+++ b/drivers/gpu/drm/radeon/r520.c
@@ -181,6 +181,7 @@ static int r520_startup(struct radeon_device *rdev)
 		if (r)
 			return r;
 	}
+	rdev->accel_working = true;
 
 	/* allocate wb buffer */
 	r = radeon_wb_init(rdev);
@@ -236,7 +237,6 @@ int r520_resume(struct radeon_device *rdev)
 	/* Initialize surface registers */
 	radeon_surface_init(rdev);
 
-	rdev->accel_working = true;
 	r = r520_startup(rdev);
 	if (r) {
 		rdev->accel_working = false;
diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index bb1958d..d022533 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -2394,6 +2394,7 @@ int r600_startup(struct radeon_device *rdev)
 		if (r)
 			return r;
 	}
+	rdev->accel_working = true;
 	r600_gpu_init(rdev);
 	r = r600_blit_init(rdev);
 	if (r) {
@@ -2477,7 +2478,6 @@ int r600_resume(struct radeon_device *rdev)
 	/* post card */
 	atom_asic_init(rdev->mode_info.atom_context);
 
-	rdev->accel_working = true;
 	r = r600_startup(rdev);
 	if (r) {
 		DRM_ERROR("r600 startup failed on resume\n");
diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index 066c98b..27e6e64 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -998,11 +998,13 @@ int radeon_gpu_reset(struct radeon_device *rdev)
 	if (!r) {
 		dev_info(rdev->dev, "GPU reset succeed\n");
 		radeon_resume(rdev);
-		radeon_restore_bios_scratch_regs(rdev);
-		drm_helper_resume_force_mode(rdev->ddev);
-		ttm_bo_unlock_delayed_workqueue(&rdev->mman.bdev, resched);
 	}
 
+	/* no matter what restore video mode */
+	radeon_restore_bios_scratch_regs(rdev);
+	drm_helper_resume_force_mode(rdev->ddev);
+	ttm_bo_unlock_delayed_workqueue(&rdev->mman.bdev, resched);
+
 	if (r) {
 		/* bad news, how to tell it to userspace ? */
 		dev_info(rdev->dev, "GPU reset failed\n");
diff --git a/drivers/gpu/drm/radeon/radeon_gart.c b/drivers/gpu/drm/radeon/radeon_gart.c
index 59d4493..c343297 100644
--- a/drivers/gpu/drm/radeon/radeon_gart.c
+++ b/drivers/gpu/drm/radeon/radeon_gart.c
@@ -117,6 +117,7 @@ void radeon_gart_table_vram_unpin(struct radeon_device *rdev)
 	if (rdev->gart.robj == NULL) {
 		return;
 	}
+	rdev->accel_working = false;
 	r = radeon_bo_reserve(rdev->gart.robj, false);
 	if (likely(r == 0)) {
 		radeon_bo_kunmap(rdev->gart.robj);
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
index f28bd4b..74176c5 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -217,6 +217,22 @@ int radeon_gem_create_ioctl(struct drm_device *dev, void *data,
 	uint32_t handle;
 	int r;
 
+	/* if in middle of gpu reset wait on the mutex, if reset
+	 * failed don't do anything, otherwise proceed.
+	 */
+	if (!rdev->accel_working) {
+		/* lockup situation wait for cs mutex to be droped */
+		radeon_mutex_lock(&rdev->cs_mutex);
+		radeon_mutex_unlock(&rdev->cs_mutex);
+		/* we are after a gpu reset */
+		if (!rdev->accel_working) {
+			/* gpu reset failed, don't create anymore GPU
+			 * object
+			 */
+			return -EBUSY;
+		}
+	}
+
 	/* create a gem object to contain this object in */
 	args->size = roundup(args->size, PAGE_SIZE);
 	r = radeon_gem_object_create(rdev, args->size, args->alignment,
@@ -240,6 +256,7 @@ int radeon_gem_create_ioctl(struct drm_device *dev, void *data,
 int radeon_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 				struct drm_file *filp)
 {
+	struct radeon_device *rdev = dev->dev_private;
 	/* transition the BO to a domain -
 	 * just validate the BO into a certain domain */
 	struct drm_radeon_gem_set_domain *args = data;
@@ -247,6 +264,22 @@ int radeon_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 	struct radeon_bo *robj;
 	int r;
 
+	/* if in middle of gpu reset wait on the mutex, if reset
+	 * failed don't do anything, otherwise proceed.
+	 */
+	if (!rdev->accel_working) {
+		/* lockup situation wait for cs mutex to be droped */
+		radeon_mutex_lock(&rdev->cs_mutex);
+		radeon_mutex_unlock(&rdev->cs_mutex);
+		/* we are after a gpu reset */
+		if (!rdev->accel_working) {
+			/* gpu reset failed, don't create anymore GPU
+			 * object
+			 */
+			return -EBUSY;
+		}
+	}
+
 	/* for now if someone requests domain CPU -
 	 * just make sure the buffer is finished with */
 
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index c94a225..287c6a8 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -399,6 +399,15 @@ static int radeon_bo_move(struct ttm_buffer_object *bo,
 		radeon_move_null(bo, new_mem);
 		return 0;
 	}
+	if (!rdev->accel_working) {
+		if (new_mem->mem_type == TTM_PL_TT ||
+		    old_mem->mem_type == TTM_PL_TT) {
+			/* can't bind/unbind tt */
+			return -EBUSY;
+		}
+		/* try memcpy */
+		goto memcpy;
+	}
 	if ((old_mem->mem_type == TTM_PL_TT &&
 	     new_mem->mem_type == TTM_PL_SYSTEM) ||
 	    (old_mem->mem_type == TTM_PL_SYSTEM &&
@@ -545,6 +554,13 @@ static int radeon_ttm_backend_bind(struct ttm_tt *ttm,
 		WARN(1, "nothing to bind %lu pages for mreg %p back %p!\n",
 		     ttm->num_pages, bo_mem, ttm);
 	}
+	if (!gtt->rdev->accel_working) {
+		/* when accel is not working GPU is in broken state just
+		 * do nothing for any ttm operation to avoid making the
+		 * situation worse than it is
+		 */
+		return 0;
+	}
 	r = radeon_gart_bind(gtt->rdev, gtt->offset,
 			     ttm->num_pages, ttm->pages, gtt->ttm.dma_address);
 	if (r) {
@@ -559,6 +575,13 @@ static int radeon_ttm_backend_unbind(struct ttm_tt *ttm)
 {
 	struct radeon_ttm_tt *gtt = (void *)ttm;
 
+	if (!gtt->rdev->accel_working) {
+		/* when accel is not working GPU is in broken state just
+		 * do nothing for any ttm operation to avoid making the
+		 * situation worse than it is
+		 */
+		return 0;
+	}
 	radeon_gart_unbind(gtt->rdev, gtt->offset, ttm->num_pages);
 	return 0;
 }
diff --git a/drivers/gpu/drm/radeon/rs400.c b/drivers/gpu/drm/radeon/rs400.c
index a464eb5..39fac11 100644
--- a/drivers/gpu/drm/radeon/rs400.c
+++ b/drivers/gpu/drm/radeon/rs400.c
@@ -404,6 +404,7 @@ static int rs400_startup(struct radeon_device *rdev)
 	r = rs400_gart_enable(rdev);
 	if (r)
 		return r;
+	rdev->accel_working = true;
 
 	/* allocate wb buffer */
 	r = radeon_wb_init(rdev);
@@ -460,7 +461,6 @@ int rs400_resume(struct radeon_device *rdev)
 	/* Initialize surface registers */
 	radeon_surface_init(rdev);
 
-	rdev->accel_working = true;
 	r = rs400_startup(rdev);
 	if (r) {
 		rdev->accel_working = false;
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c
index e95c5e6..5633203 100644
--- a/drivers/gpu/drm/radeon/rs600.c
+++ b/drivers/gpu/drm/radeon/rs600.c
@@ -886,6 +886,7 @@ static int rs600_startup(struct radeon_device *rdev)
 	r = rs600_gart_enable(rdev);
 	if (r)
 		return r;
+	rdev->accel_working = true;
 
 	/* allocate wb buffer */
 	r = radeon_wb_init(rdev);
@@ -946,7 +947,6 @@ int rs600_resume(struct radeon_device *rdev)
 	/* Initialize surface registers */
 	radeon_surface_init(rdev);
 
-	rdev->accel_working = true;
 	r = rs600_startup(rdev);
 	if (r) {
 		rdev->accel_working = false;
diff --git a/drivers/gpu/drm/radeon/rs690.c b/drivers/gpu/drm/radeon/rs690.c
index 159b6a4..633220e 100644
--- a/drivers/gpu/drm/radeon/rs690.c
+++ b/drivers/gpu/drm/radeon/rs690.c
@@ -615,6 +615,7 @@ static int rs690_startup(struct radeon_device *rdev)
 	r = rs400_gart_enable(rdev);
 	if (r)
 		return r;
+	rdev->accel_working = true;
 
 	/* allocate wb buffer */
 	r = radeon_wb_init(rdev);
@@ -675,7 +676,6 @@ int rs690_resume(struct radeon_device *rdev)
 	/* Initialize surface registers */
 	radeon_surface_init(rdev);
 
-	rdev->accel_working = true;
 	r = rs690_startup(rdev);
 	if (r) {
 		rdev->accel_working = false;
diff --git a/drivers/gpu/drm/radeon/rv515.c b/drivers/gpu/drm/radeon/rv515.c
index 7f08ced..81af328 100644
--- a/drivers/gpu/drm/radeon/rv515.c
+++ b/drivers/gpu/drm/radeon/rv515.c
@@ -386,6 +386,7 @@ static int rv515_startup(struct radeon_device *rdev)
 		if (r)
 			return r;
 	}
+	rdev->accel_working = true;
 
 	/* allocate wb buffer */
 	r = radeon_wb_init(rdev);
@@ -441,7 +442,6 @@ int rv515_resume(struct radeon_device *rdev)
 	/* Initialize surface registers */
 	radeon_surface_init(rdev);
 
-	rdev->accel_working = true;
 	r =  rv515_startup(rdev);
 	if (r) {
 		rdev->accel_working = false;
diff --git a/drivers/gpu/drm/radeon/rv770.c b/drivers/gpu/drm/radeon/rv770.c
index b4f51c5..95a594c 100644
--- a/drivers/gpu/drm/radeon/rv770.c
+++ b/drivers/gpu/drm/radeon/rv770.c
@@ -910,6 +910,7 @@ static int rv770_startup(struct radeon_device *rdev)
 		if (r)
 			return r;
 	}
+	rdev->accel_working = true;
 
 	rv770_gpu_init(rdev);
 	r = r600_blit_init(rdev);
@@ -979,7 +980,6 @@ int rv770_resume(struct radeon_device *rdev)
 	/* post card */
 	atom_asic_init(rdev->mode_info.atom_context);
 
-	rdev->accel_working = true;
 	r = rv770_startup(rdev);
 	if (r) {
 		DRM_ERROR("r600 startup failed on resume\n");
diff --git a/drivers/gpu/drm/radeon/si.c b/drivers/gpu/drm/radeon/si.c
index c7b61f1..813cc34 100644
--- a/drivers/gpu/drm/radeon/si.c
+++ b/drivers/gpu/drm/radeon/si.c
@@ -3675,6 +3675,7 @@ static int si_startup(struct radeon_device *rdev)
 	r = si_pcie_gart_enable(rdev);
 	if (r)
 		return r;
+	rdev->accel_working = true;
 	si_gpu_init(rdev);
 
 #if 0
@@ -3795,7 +3796,6 @@ int si_resume(struct radeon_device *rdev)
 	/* post card */
 	atom_asic_init(rdev->mode_info.atom_context);
 
-	rdev->accel_working = true;
 	r = si_startup(rdev);
 	if (r) {
 		DRM_ERROR("si startup failed on resume\n");
-- 
1.7.10.2

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux