Re: [PATCH 9/9] drm/i915/gt: Schedule request retirement when timeline idles

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Wed, 20 Nov 2019 14:16:17 +0000

On 20/11/2019 13:39, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-11-20 13:16:51)

On 20/11/2019 09:33, Chris Wilson wrote:
The major drawback of commit 7e34f4e4aad3 ("drm/i915/gen8+: Add RC6 CTX
corruption WA") is that it disables RC6 while Skylake (and friends) is
active, and we do not consider the GPU idle until all outstanding
requests have been retired and the engine switched over to the kernel
context. If userspace is idle, this task falls onto our background idle
worker, which only runs roughly once a second, meaning that userspace has
to have been idle for a couple of seconds before we enable RC6 again.
Naturally, this causes us to consume considerably more energy than
before as powersaving is effectively disabled while a display server
(here's looking at you Xorg) is running.

As execlists will get a completion event as each context is completed,
we can use this interrupt to queue a retire worker bound to this engine
to cleanup idle timelines. We will then immediately notice the idle
engine (without userspace intervention or the aid of the background
retire worker) and start parking the GPU. Thus during light workloads,
we will do much more work to idle the GPU faster...  Hopefully with
commensurate power saving!

v2: Watch context completions and only look at those local to the engine
when retiring to reduce the amount of excess work we perform.

References: https://bugs.freedesktop.org/show_bug.cgi?id=112315
References: 7e34f4e4aad3 ("drm/i915/gen8+: Add RC6 CTX corruption WA")
References: 2248a28384fe ("drm/i915/gen8+: Add RC6 CTX corruption WA")
Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
---
   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  8 +-
   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  8 ++
   drivers/gpu/drm/i915/gt/intel_gt_requests.c   | 74 +++++++++++++++++++
   drivers/gpu/drm/i915/gt/intel_gt_requests.h   | 17 ++++-
   drivers/gpu/drm/i915/gt/intel_lrc.c           |  9 +++
   drivers/gpu/drm/i915/gt/intel_timeline.c      |  1 +
   .../gpu/drm/i915/gt/intel_timeline_types.h    |  3 +
   7 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index b9613d044393..8f6e353caa66 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -28,13 +28,13 @@
   
   #include "i915_drv.h"
   
-#include "gt/intel_gt.h"
-
+#include "intel_context.h"
   #include "intel_engine.h"
   #include "intel_engine_pm.h"
   #include "intel_engine_pool.h"
   #include "intel_engine_user.h"
-#include "intel_context.h"
+#include "intel_gt.h"
+#include "intel_gt_requests.h"
   #include "intel_lrc.h"
   #include "intel_reset.h"
   #include "intel_ring.h"
@@ -617,6 +617,7 @@ static int intel_engine_setup_common(struct intel_engine_cs *engine)
       intel_engine_init_execlists(engine);
       intel_engine_init_cmd_parser(engine);
       intel_engine_init__pm(engine);
+     intel_engine_init_retire(engine);
   
       intel_engine_pool_init(&engine->pool);
   
@@ -839,6 +840,7 @@ void intel_engine_cleanup_common(struct intel_engine_cs *engine)
   
       cleanup_status_page(engine);
   
+     intel_engine_fini_retire(engine);
       intel_engine_pool_fini(&engine->pool);
       intel_engine_fini_breadcrumbs(engine);
       intel_engine_cleanup_cmd_parser(engine);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 758f0e8ec672..17f1f1441efc 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -451,6 +451,14 @@ struct intel_engine_cs {
   
       struct intel_engine_execlists execlists;
   
+     /*
+      * Keep track of completed timelines on this engine for early
+      * retirement with the goal of quickly enabling powersaving as
+      * soon as the engine is idle.
+      */
+     struct intel_timeline *retire;
+     struct work_struct retire_work;
+
       /* status_notifier: list of callbacks for context-switch changes */
       struct atomic_notifier_head context_status_notifier;
   
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
index 4dc3cbeb1b36..4a98fefdf915 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
@@ -29,6 +29,80 @@ static void flush_submission(struct intel_gt *gt)
               intel_engine_flush_submission(engine);
   }
   
+static void engine_retire(struct work_struct *work)
+{
+     struct intel_engine_cs *engine =
+             container_of(work, typeof(*engine), retire_work);
+     struct intel_timeline *tl = xchg(&engine->retire, NULL);

Shouldn't this be atomic_xchg to avoid racing with add_retire?

+
+     do {
+             struct intel_timeline *next = xchg(&tl->retire, NULL);

Here as well?

xchg() are always locked.

atomic_xchg() operates on atomic_t; xchg() works on any variable, like
cmpxchg().

+
+             /*
+              * Our goal here is to retire _idle_ timelines as soon as
+              * possible (as they are idle, we do not expect userspace
+              * to be cleaning up anytime soon).
+              *
+              * If the tl->active_count is already zero, someone else
+              * should have retired the timeline. Equally if the timeline
+              * is currently locked, either it is being retired elsewhere
+              * or about to be!
+              */
+             if (atomic_read(&tl->active_count) &&
+                 mutex_trylock(&tl->mutex)) {
+                     retire_requests(tl);
+                     mutex_unlock(&tl->mutex);
+             }
+             intel_timeline_put(tl);
+
+             GEM_BUG_ON(!next);
+             tl = ptr_mask_bits(next, 1);

You sometimes expect engine->retire to contain 0x1?

Yes, imagine that we are submitting very fast such that we schedule_out
the same context before the worker ran, we would then try to
add_retire() the same timeline again. So I was using BIT(0) to tag an
active element in the retirement list.

If we have schedule_out on the same context before the retire ran then 
add_retire is a no-op due tl->retire being set.

It could be a race between engine_retire and add_retire where 
engine->retire is set to NULL so latter would set tl->retire to NULL | 
BIT(0).

But BIT(0) is only filtered out and not acted upon anywhere so I am 
still lost.

So maybe from the opposite angle, what goes wrong if you drop all BIT(0) 
business?


+     } while (tl);
+}
+
+static bool add_retire(struct intel_engine_cs *engine,
+                    struct intel_timeline *tl)
+{
+     struct intel_timeline *first = READ_ONCE(engine->retire);
+
+     /*
+      * We open-code a llist here to include the additional tag [BIT(0)]
+      * so that we know when the timeline is already on a
+      * retirement queue: either this engine or another.
+      *
+      * However, we rely on that a timeline can only be active on a single
+      * engine at any one time and that add_retire() is called before the
+      * engine releases the timeline and transferred to another to retire.
+      */
+
+     if (READ_ONCE(tl->retire)) /* already queued */
+             return false;

Can't this go first in the function?

Conceptually it is. And I made it so because I also decided against
having the READ_ONCE() at the top.

But READ_ONCE is at the top, just a different one which is discarded if 
this one is NULL.


+
+     intel_timeline_get(tl);
+     do
+             tl->retire = ptr_pack_bits(first, 1, 1);

Here you rely on assignment being atomic right?

Ish. Here we rely on the timeline being owned by the engine so it cannot
be submitted by another (and so schedule_out called) until this engine
has released it.

It is a weak point for generality, but the ordering is strong in
execlists.

+     while (!try_cmpxchg(&engine->retire, &first, tl));

So the loop is effectively creating a chain of timelines to retire on
this engine.

What happens with virtual engines when a timeline goes to different
engine before (well or any single timeline context) the retire worker
runs? Ah okay, it gets re-assigned to the most recent engine.

Right. The engine_retire() doesn't care which engine the timelines were
run on, it's just a list of suspected idle timelines.

I am not sure about the BIT(0) business. It's always set on write so I
am not getting why it is useful.

It's also set to 0 on consumption :)

Another thing - assert that engine->retire is NULL after flush_work in 
intel_engine_fini_retire could be useful.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx