Re: [PATCH v3] drm/i915: Replace gen6 semaphore signal table with code

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 21/07/16 14:46, Tvrtko Ursulin wrote:

On 21/07/16 14:31, Chris Wilson wrote:
On Thu, Jul 21, 2016 at 02:16:22PM +0100, Tvrtko Ursulin wrote:

On 21/07/16 13:59, Chris Wilson wrote:
On Thu, Jul 21, 2016 at 01:00:47PM +0100, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

Static table wastes space for invalid combinations and
engines which are not supported by Gen6 (legacy semaphores).

Replace it with a function devised by Dave Gordon.

I have verified that it generates the same mappings between
mbox selectors and signalling registers.

So just how big was that table? How big are the functions replacing it?

With I915_NUM_ENGINES of 5 table is 5 * 5 * (2 * 4) = 200 bytes.

With the patch .text grows by 144 bytes here and .rodata shrinks by
256. So a net gain of 112 bytes with my config. Conclusion is that
as long as we got five engines it is not that interesting to get rid
of the table.

Since the semaphore matrix is only relevant to a specific gen, you could remove it from the multi-generational engine-list and instead just have it in the gen-specific code that needs it. That way it won't continue to grow as new engines are added. The one gen that needs it is fixed at 4x4, so it could just be a 16-byte lookup table, or 32 bits
(0b11001001_10110001_00101101_10010011) if you really want to save space ;-)

v2: Add a comment describing what gen6_sem_f does.
v3: This time with git add.

I like having the table a lot... Even if we don't find the function
convincing we should add that comment.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: Dave Gordon <david.s.gordon@xxxxxxxxx>
Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
---
  drivers/gpu/drm/i915/i915_reg.h         |  7 +--
  drivers/gpu/drm/i915/intel_engine_cs.c  | 93
+++++++++++++++++++++++++++++++++
  drivers/gpu/drm/i915/intel_ringbuffer.c | 40 +-------------
  drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++
  4 files changed, 102 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_reg.h
b/drivers/gpu/drm/i915/i915_reg.h
index 9397ddec26b9..c2fe718582c8 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1604,9 +1604,10 @@ enum skl_disp_power_wells {
  #define RING_HEAD(base)        _MMIO((base)+0x34)
  #define RING_START(base)    _MMIO((base)+0x38)
  #define RING_CTL(base)        _MMIO((base)+0x3c)
-#define RING_SYNC_0(base)    _MMIO((base)+0x40)
-#define RING_SYNC_1(base)    _MMIO((base)+0x44)
-#define RING_SYNC_2(base)    _MMIO((base)+0x48)
+#define RING_SYNC(base, n)    _MMIO((base) + 0x40 + (n) * 4)
+#define RING_SYNC_0(base)    RING_SYNC(base, 0)
+#define RING_SYNC_1(base)    RING_SYNC(base, 1)
+#define RING_SYNC_2(base)    RING_SYNC(base, 2)
  #define GEN6_RVSYNC    (RING_SYNC_0(RENDER_RING_BASE))
  #define GEN6_RBSYNC    (RING_SYNC_1(RENDER_RING_BASE))
  #define GEN6_RVESYNC    (RING_SYNC_2(RENDER_RING_BASE))
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c
b/drivers/gpu/drm/i915/intel_engine_cs.c
index f4a35ec78481..19455b20b322 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -209,3 +209,96 @@ int intel_engine_init_common(struct
intel_engine_cs *engine)

      return i915_cmd_parser_init_ring(engine);
  }
+
+#define I915_NUM_GEN6_SEMAPHORE_ENGINES (4)
+
+/*
+ * For Gen6 semaphores where the driver issues MI_SEMAPHORE_MBOX
commands
+ * with register selects so that a specific engine can wake up
another engine
+ * waiting on a matching register, the matrix of required register
selects
+ * looks like this:
+ *
+ *      |            RCS            |           VCS
|           BCS             |         VECS
+ *
-----+---------------------------+---------------------------+---------------------------+---------------------------

+ *  RCS | MI_SEMAPHORE_SYNC_INVALID |    MI_SEMAPHORE_SYNC_VR
|    MI_SEMAPHORE_SYNC_BR   |    MI_SEMAPHORE_SYNC_VER
+ *  VCS |    MI_SEMAPHORE_SYNC_RV   | MI_SEMAPHORE_SYNC_INVALID
|    MI_SEMAPHORE_SYNC_BV   |    MI_SEMAPHORE_SYNC_VEV
+ *  BCS |    MI_SEMAPHORE_SYNC_RB   |    MI_SEMAPHORE_SYNC_VB   |
MI_SEMAPHORE_SYNC_INVALID |    MI_SEMAPHORE_SYNC_VEB
+ * VECS |    MI_SEMAPHORE_SYNC_RVE  |    MI_SEMAPHORE_SYNC_VVE
|    MI_SEMAPHORE_SYNC_BVE  | MI_SEMAPHORE_SYNC_INVALID
+ *
+ * This distilled to integers looks like this:
+ *
+ *   |  0  |  1  |  2  |  3
+ * --+-----+-----+-----+-----
+ * 0 | -1  |  0  |  2  |  1
+ * 1 |  2  | -1  |  0  |  1
+ * 2 |  0  |  2  | -1  |  1
+ * 3 |  2  |  1  |  0  | -1

Actually (and conveniently) MI_SEMAPHORE_SYNC_INVALID is 3 (<<16) so we don't really need to return -1 and then map it to INVALID, we can just use 0-3 directly. The binary string I wrote above represents this table; then to get the result we want it just has to be shifted.

+ *
+ * In the opposite direction, the same table showing register
addresses is:
+ *
+ *      |     RCS      |     VCS      |     BCS      |    VECS
+ * -----+--------------+--------------+--------------+--------------
+ *  RCS | GEN6_NOSYNC  | GEN6_RVSYNC  | GEN6_RBSYNC  | GEN6_RVESYNC
+ *  VCS | GEN6_VRSYNC  | GEN6_NOSYNC  | GEN6_VBSYNC  | GEN6_VVESYNC
+ *  BCS | GEN6_VRSYNC  | GEN6_BVSYNC  | GEN6_NOSYNC  | GEN6_BVESYNC
+ * VECS | GEN6_VERSYNC | GEN6_VEVSYNC | GEN6_VEBSYNC | GEN6_NOSYNC
+ *
+ * Again this distilled to integers looks like this:
+ *
+ *   |  0  |  1  |  2  |  3
+ * --+-----+-----+-----+-----
+ * 0 | -1  |  0  |  1  |  2
+ * 1 |  1  | -1  |  0  |  2
+ * 2 |  0  |  1  | -1  |  2
+ * 3 |  1  |  2  |  0  | -1

With that table as the first function f1(returning 0-3), the second function could just be a lookup in a 4-entry array indexed by the result. Or convert 3 to NOSYNC, then the rest is (3-f1(x,y)) % 3.

I think those might give the best combination of code+data size :)

/*
  *               X
  *      |  0  |  1  |  2  |  3
  *    --+-----+-----+-----+-----
  *    0 |     |  0  |  1  |  2
  * Y  1 |  1  |     |  0  |  2
  *    2 |  0  |  1  |     |  2
  *    3 |  1  |  2  |  0  |
  */

You want another copy of the table here?

Yes. In particular, I need to know which axis is X and which is Y.
Having the table here is much easier to compare to the output of the
code (same screen).

Ok.

Let's call them 'from' and 'to' (or 'signaller' and 'waiter', though that's rather long) rather than x & y,

.Dave.

+    x -= x >= y;
+    if (y == 1)
+        x = 3 - x;
+    x += y & 1;
+    return x % 3;
+}
+
+u32 gen6_wait_mbox(enum intel_engine_id x, enum intel_engine_id y)

static...

It is called from intel_ringbuffer.c.

Hmm. This was in intel_ringbuffer.c, at least I assumed so as this only
applies to legacy submission, for gen6-7.

It uses the static intel_engines array since the dev_priv->engines are
not initialized yet by the time it runs, for an engine.

Could as an alternative make the engine init phase multi-pass. Maybe.
Not sure what repercussions for the cleanup path that would have.

Regards,
Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux