If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx> --- This may be -stable material. drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_ring.c | 23 ++++++++++++++++------- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8263af3..9de5778 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -652,6 +652,7 @@ struct radeon_ring { unsigned ring_free_dw; int count_dw; unsigned long last_activity; + unsigned long last_test_lockup; unsigned last_rptr; uint64_t gpu_addr; uint32_t align_mask; diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. * * Caller should write to the ring to force CP to do something so we don't get * false positive when CP is just gived nothing to do. @@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) **/ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring) { - unsigned long cjiffies, elapsed; + unsigned long cjiffies, elapsed, last_test; uint32_t rptr; cjiffies = jiffies; + + last_test = ring->last_test_lockup; + ring->last_test_lockup = cjiffies; + if (!time_after(cjiffies, ring->last_activity)) { /* likely a wrap around */ radeon_ring_lockup_update(ring); @@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin radeon_ring_lockup_update(ring); return false; } + if (cjiffies - last_test > 2 * HZ) { + /* Possible race -- see comment above */ + radeon_ring_lockup_update(ring); + return false; + } elapsed = jiffies_to_msecs(cjiffies - ring->last_activity); if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) { dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed); -- 1.8.1.4 _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel