Re: [RFC 4/4] drm/i915/gt: Pipelined page migration

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Mon, 30 Nov 2020 16:26:44 +0000

On 30/11/2020 16:21, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2020-11-30 16:07:44)

On 30/11/2020 13:39, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2020-11-30 13:12:55)
On 28/11/2020 18:40, Chris Wilson wrote:
If we pipeline the PTE updates and then do the copy of those pages
within a single unpreemptible command packet, we can submit the copies
and leave them to be scheduled without having to synchronously wait
under a global lock.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
---
    drivers/gpu/drm/i915/Makefile                 |   1 +
    drivers/gpu/drm/i915/gt/intel_engine.h        |   1 +
    drivers/gpu/drm/i915/gt/intel_migrate.c       | 370 ++++++++++++++++++
    drivers/gpu/drm/i915/gt/intel_migrate.h       |  33 ++
    drivers/gpu/drm/i915/gt/selftest_migrate.c    | 105 +++++
    .../drm/i915/selftests/i915_live_selftests.h  |   1 +
    6 files changed, 511 insertions(+)
    create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c
    create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h
    create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index e5574e506a5c..0b2e12c87f9d 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -103,6 +103,7 @@ gt-y += \
        gt/intel_gtt.o \
        gt/intel_llc.o \
        gt/intel_lrc.o \
+     gt/intel_migrate.o \
        gt/intel_mocs.o \
        gt/intel_ppgtt.o \
        gt/intel_rc6.o \
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index ac58fcda4927..079d26b47a97 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -188,6 +188,7 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
    #define I915_GEM_HWS_PREEMPT_ADDR   (I915_GEM_HWS_PREEMPT * sizeof(u32))
    #define I915_GEM_HWS_SEQNO          0x40
    #define I915_GEM_HWS_SEQNO_ADDR             (I915_GEM_HWS_SEQNO * sizeof(u32))
+#define I915_GEM_HWS_MIGRATE         (0x42 * sizeof(u32))
    #define I915_GEM_HWS_SCRATCH                0x80
    
    #define I915_HWS_CSB_BUF0_INDEX             0x10
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
new file mode 100644
index 000000000000..4d7bd32eb8d4
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include "i915_drv.h"
+#include "intel_context.h"
+#include "intel_gt.h"
+#include "intel_gtt.h"
+#include "intel_lrc.h" /* virtual engine */
+#include "intel_migrate.h"
+#include "intel_ring.h"
+
+#define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
+
+static void insert_pte(struct i915_address_space *vm,
+                    struct i915_page_table *pt,
+                    void *data)
+{
+     u64 *offset = data;
+
+     vm->insert_page(vm, px_dma(pt), *offset, I915_CACHE_NONE, 0);
+     *offset += PAGE_SIZE;
+}
+
+static struct i915_address_space *migrate_vm(struct intel_gt *gt)
+{
+     struct i915_vm_pt_stash stash = {};
+     struct i915_ppgtt *vm;
+     u64 offset, sz;
+     int err;
+
+     vm = i915_ppgtt_create(gt);
+     if (IS_ERR(vm))
+             return ERR_CAST(vm);
+
+     if (!vm->vm.allocate_va_range || !vm->vm.foreach) {
+             err = -ENODEV;
+             goto err_vm;
+     }
+
+     /*
+      * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
+      * 4x2 page directories for source/destination.
+      */
+     sz = 2 * CHUNK_SZ;
+     offset = sz;
+
+     /*
+      * We need another page directory setup so that we can write
+      * the 8x512 PTE in each chunk.
+      */
+     sz += (sz >> 12) * sizeof(u64);
+
+     err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
+     if (err)
+             goto err_vm;
+
+     err = i915_vm_pin_pt_stash(&vm->vm, &stash);
+     if (err) {
+             i915_vm_free_pt_stash(&vm->vm, &stash);
+             goto err_vm;
+     }
+
+     vm->vm.allocate_va_range(&vm->vm, &stash, 0, sz);
+     i915_vm_free_pt_stash(&vm->vm, &stash);
+
+     /* Now allow the GPU to rewrite the PTE via its own ppGTT */
+     vm->vm.foreach(&vm->vm, 0, sz, insert_pte, &offset);

This is just making the [0 - sz) gva point to the allocated sz bytes of
backing store?

Not quite, we are pointing [offset, sz) to the page directories
themselves. When we write into that range, we are modifying the PTE of
this ppGTT. (Breaking the fourth wall, allowing the non-privileged
context to update its own page tables.)

Confusing, so first step is here making [0, sz) gva ptes point to pte
backing store.

Which means emit_pte when called from intel_context_migrate_pages is
emitting sdw which will write into [0, sz] gva, effectively placing the
src and dest object content at those location.

Then blit copies the data.

What happens if context is re-used? Gva's no longer point to PTEs so
how? Is a step to re-initialize after use missing? More sdw after the
blit copy to make it point back to ptes?

Every chunk starts with writing the PTEs used by the chunk. Within the
chunk, the GPU commands are not preemptible, so the PTEs cannot be
replaced before they are consumed.

We do leave the PTE pointing to stale pages after the blit, but since
the vm is only used by first rewriting the PTE, those dangling PTE
should never be chased.

So each chunk does a fixed copy between the same pair of addresses
within the vm; the magic is in rewriting the PTE to point at the pages
of interest for the chunk.

Isn't the vm getting potentially re-used on the fall-back path, if 
virtual engine creation fails? In which case dangling PTEs would be 
written to. Or I am still missing something?

+             i915_request_add(rq);
+             if (it_s.sg)
+                     cond_resched();

   From what context does this run? No preemptible?

Has to be process context; numerous allocations, implicit waits (that we
want to avoid in practice), and the timeline (per-context) mutex to
guard access to the ringbuffer.

I guess on non-preemptible, or not fully preemptible kernel it is useful.

Oh, the cond_resched() is almost certainly overkill. But potentially a
very large loop so worth being careful.

Yes, could be a large loop so doesn't harm.

Is the ringbuffer large enough to avoid stalls in that path?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx