Re: [PATCH 07/32] drm/i915: Store the reset counter when constructing a request

Dave Gordon <david.s.gordon@xxxxxxxxx> · Mon, 4 Jan 2016 15:58:21 +0000

On 16/12/15 10:19, Chris Wilson wrote:
On Wed, Dec 16, 2015 at 10:44:19AM +0100, Daniel Vetter wrote:
On Fri, Dec 11, 2015 at 11:33:03AM +0000, Chris Wilson wrote:
As the request is only valid during the same global reset epoch, we can
record the current reset_counter when constructing the request and reuse
it when waiting upon that request in future. This removes a very hairy
atomic check serialised by the struct_mutex at the time of waiting and
allows us to transfer those waits to a central dispatcher for all
waiters and all requests.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>

"global reset epoch" got me thinking about what the implications for TDR
are. Now the current TDR patches iirc are still somewhat tacked on the
side of the existing reset handling, and so don't end up incrementing the
reset counter.

But I don't agree with that design decision, and I think any reset must
increment the counter (and do the reset-in-progress flag dance), since
that's the only way to guarantee that waiters will get off locks. Even
though we have fewer of those with every kernel release.

But then there's the problem that at least for some request they'll still
be valid after the reset, and would break with this design here. Which can
easily be fixed by incrementing the reset epoch for every request we
decide should keep running (or which gets re-scheduled/emitted for another
attempt), and doing that explicitly seems like a good idea.

The only implication I see is that for per-ring reset we might want to go
to a per-engine reset epoch counter.

Yes, I think per-engine epoch's will be required. (I've been trying to
choose my words carefully so that it is clear that it only applies to
today and not in a TDR future!)

So given all that I think this is a solid idea. But one comment below as a
fallout.

---
  drivers/gpu/drm/i915/i915_drv.h         |  2 +-
  drivers/gpu/drm/i915/i915_gem.c         | 40 +++++++++++----------------------
  drivers/gpu/drm/i915/intel_display.c    |  7 +-----
  drivers/gpu/drm/i915/intel_lrc.c        |  7 ------
  drivers/gpu/drm/i915/intel_ringbuffer.c |  6 -----
  5 files changed, 15 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 1043ddd670a5..f30c305a6889 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2178,6 +2178,7 @@ struct drm_i915_gem_request {
  	/** On Which ring this request was generated */
  	struct drm_i915_private *i915;
  	struct intel_engine_cs *ring;
+	unsigned reset_counter;

  	 /** GEM sequence number associated with the previous request,
  	  * when the HWS breadcrumb is equal to this the GPU is processing
@@ -3059,7 +3060,6 @@ void __i915_add_request(struct drm_i915_gem_request *req,
  #define i915_add_request_no_flush(req) \
  	__i915_add_request(req, NULL, false)
  int __i915_wait_request(struct drm_i915_gem_request *req,
-			unsigned reset_counter,
  			bool interruptible,
  			s64 *timeout,
  			struct intel_rps_client *rps);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 27e617b76418..b17cc0e42a4f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1214,7 +1214,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
  /**
   * __i915_wait_request - wait until execution of request has finished
   * @req: duh!
- * @reset_counter: reset sequence associated with the given request
   * @interruptible: do an interruptible wait (normally yes)
   * @timeout: in - how long to wait (NULL forever); out - how much time remaining
   *
@@ -1229,7 +1228,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
   * errno with remaining time filled in timeout argument.
   */
  int __i915_wait_request(struct drm_i915_gem_request *req,
-			unsigned reset_counter,
  			bool interruptible,
  			s64 *timeout,
  			struct intel_rps_client *rps)
@@ -1288,7 +1286,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,

  		/* We need to check whether any gpu reset happened in between
  		 * the caller grabbing the seqno and now ... */
-		if (reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
+		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {

READ_ONCE(req->reset) as a future proofing maybe for TDR? Or punt that to
TDR? Could be confusing to READ_ONCE something that's (with the current
code at least) invariant.

Right, my plan was to solve this in TDR! At the moment, I am thinking
that we regenerate a new request for each one resubmitted (that requires
patching the seqno embedded into the ring before restarting the ring),
but otherwise we can just duplicate the contents of the request.
-Chris

With the scheduler, there's no need to patch the contents of the 
ringbuffer; we just reset it (so it's empty) and then have the scheduler 
re-emit the requests, starting after the one that hung.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx