Quoting Tvrtko Ursulin (2020-11-30 16:26:44) > > On 30/11/2020 16:21, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2020-11-30 16:07:44) > >> > >> On 30/11/2020 13:39, Chris Wilson wrote: > >>> Quoting Tvrtko Ursulin (2020-11-30 13:12:55) > >>>> On 28/11/2020 18:40, Chris Wilson wrote: > >>>>> If we pipeline the PTE updates and then do the copy of those pages > >>>>> within a single unpreemptible command packet, we can submit the copies > >>>>> and leave them to be scheduled without having to synchronously wait > >>>>> under a global lock. > >>>>> > >>>>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>>>> --- > >>>>> drivers/gpu/drm/i915/Makefile | 1 + > >>>>> drivers/gpu/drm/i915/gt/intel_engine.h | 1 + > >>>>> drivers/gpu/drm/i915/gt/intel_migrate.c | 370 ++++++++++++++++++ > >>>>> drivers/gpu/drm/i915/gt/intel_migrate.h | 33 ++ > >>>>> drivers/gpu/drm/i915/gt/selftest_migrate.c | 105 +++++ > >>>>> .../drm/i915/selftests/i915_live_selftests.h | 1 + > >>>>> 6 files changed, 511 insertions(+) > >>>>> create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c > >>>>> create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h > >>>>> create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c > >>>>> > >>>>> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile > >>>>> index e5574e506a5c..0b2e12c87f9d 100644 > >>>>> --- a/drivers/gpu/drm/i915/Makefile > >>>>> +++ b/drivers/gpu/drm/i915/Makefile > >>>>> @@ -103,6 +103,7 @@ gt-y += \ > >>>>> gt/intel_gtt.o \ > >>>>> gt/intel_llc.o \ > >>>>> gt/intel_lrc.o \ > >>>>> + gt/intel_migrate.o \ > >>>>> gt/intel_mocs.o \ > >>>>> gt/intel_ppgtt.o \ > >>>>> gt/intel_rc6.o \ > >>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h > >>>>> index ac58fcda4927..079d26b47a97 100644 > >>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h > >>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h > >>>>> @@ -188,6 +188,7 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value) > >>>>> #define I915_GEM_HWS_PREEMPT_ADDR (I915_GEM_HWS_PREEMPT * sizeof(u32)) > >>>>> #define I915_GEM_HWS_SEQNO 0x40 > >>>>> #define I915_GEM_HWS_SEQNO_ADDR (I915_GEM_HWS_SEQNO * sizeof(u32)) > >>>>> +#define I915_GEM_HWS_MIGRATE (0x42 * sizeof(u32)) > >>>>> #define I915_GEM_HWS_SCRATCH 0x80 > >>>>> > >>>>> #define I915_HWS_CSB_BUF0_INDEX 0x10 > >>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c > >>>>> new file mode 100644 > >>>>> index 000000000000..4d7bd32eb8d4 > >>>>> --- /dev/null > >>>>> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c > >>>>> @@ -0,0 +1,370 @@ > >>>>> +// SPDX-License-Identifier: MIT > >>>>> +/* > >>>>> + * Copyright © 2020 Intel Corporation > >>>>> + */ > >>>>> + > >>>>> +#include "i915_drv.h" > >>>>> +#include "intel_context.h" > >>>>> +#include "intel_gt.h" > >>>>> +#include "intel_gtt.h" > >>>>> +#include "intel_lrc.h" /* virtual engine */ > >>>>> +#include "intel_migrate.h" > >>>>> +#include "intel_ring.h" > >>>>> + > >>>>> +#define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */ > >>>>> + > >>>>> +static void insert_pte(struct i915_address_space *vm, > >>>>> + struct i915_page_table *pt, > >>>>> + void *data) > >>>>> +{ > >>>>> + u64 *offset = data; > >>>>> + > >>>>> + vm->insert_page(vm, px_dma(pt), *offset, I915_CACHE_NONE, 0); > >>>>> + *offset += PAGE_SIZE; > >>>>> +} > >>>>> + > >>>>> +static struct i915_address_space *migrate_vm(struct intel_gt *gt) > >>>>> +{ > >>>>> + struct i915_vm_pt_stash stash = {}; > >>>>> + struct i915_ppgtt *vm; > >>>>> + u64 offset, sz; > >>>>> + int err; > >>>>> + > >>>>> + vm = i915_ppgtt_create(gt); > >>>>> + if (IS_ERR(vm)) > >>>>> + return ERR_CAST(vm); > >>>>> + > >>>>> + if (!vm->vm.allocate_va_range || !vm->vm.foreach) { > >>>>> + err = -ENODEV; > >>>>> + goto err_vm; > >>>>> + } > >>>>> + > >>>>> + /* > >>>>> + * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need > >>>>> + * 4x2 page directories for source/destination. > >>>>> + */ > >>>>> + sz = 2 * CHUNK_SZ; > >>>>> + offset = sz; > >>>>> + > >>>>> + /* > >>>>> + * We need another page directory setup so that we can write > >>>>> + * the 8x512 PTE in each chunk. > >>>>> + */ > >>>>> + sz += (sz >> 12) * sizeof(u64); > >>>>> + > >>>>> + err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz); > >>>>> + if (err) > >>>>> + goto err_vm; > >>>>> + > >>>>> + err = i915_vm_pin_pt_stash(&vm->vm, &stash); > >>>>> + if (err) { > >>>>> + i915_vm_free_pt_stash(&vm->vm, &stash); > >>>>> + goto err_vm; > >>>>> + } > >>>>> + > >>>>> + vm->vm.allocate_va_range(&vm->vm, &stash, 0, sz); > >>>>> + i915_vm_free_pt_stash(&vm->vm, &stash); > >>>>> + > >>>>> + /* Now allow the GPU to rewrite the PTE via its own ppGTT */ > >>>>> + vm->vm.foreach(&vm->vm, 0, sz, insert_pte, &offset); > >>>> > >>>> This is just making the [0 - sz) gva point to the allocated sz bytes of > >>>> backing store? > >>> > >>> Not quite, we are pointing [offset, sz) to the page directories > >>> themselves. When we write into that range, we are modifying the PTE of > >>> this ppGTT. (Breaking the fourth wall, allowing the non-privileged > >>> context to update its own page tables.) > >> > >> Confusing, so first step is here making [0, sz) gva ptes point to pte > >> backing store. > >> > >> Which means emit_pte when called from intel_context_migrate_pages is > >> emitting sdw which will write into [0, sz] gva, effectively placing the > >> src and dest object content at those location. > >> > >> Then blit copies the data. > >> > >> What happens if context is re-used? Gva's no longer point to PTEs so > >> how? Is a step to re-initialize after use missing? More sdw after the > >> blit copy to make it point back to ptes? > > > > Every chunk starts with writing the PTEs used by the chunk. Within the > > chunk, the GPU commands are not preemptible, so the PTEs cannot be > > replaced before they are consumed. > > > > We do leave the PTE pointing to stale pages after the blit, but since > > the vm is only used by first rewriting the PTE, those dangling PTE > > should never be chased. > > > > So each chunk does a fixed copy between the same pair of addresses > > within the vm; the magic is in rewriting the PTE to point at the pages > > of interest for the chunk. > > Isn't the vm getting potentially re-used on the fall-back path, if > virtual engine creation fails? In which case dangling PTEs would be > written to. Or I am still missing something? The same vm is used by all contexts (only one is created). We require that the PTE write + copy occur uninterrupted, which is fine as we control the arbitration points. > >>>>> + i915_request_add(rq); > >>>>> + if (it_s.sg) > >>>>> + cond_resched(); > >>>> > >>>> From what context does this run? No preemptible? > >>> > >>> Has to be process context; numerous allocations, implicit waits (that we > >>> want to avoid in practice), and the timeline (per-context) mutex to > >>> guard access to the ringbuffer. > >> > >> I guess on non-preemptible, or not fully preemptible kernel it is useful. > > > > Oh, the cond_resched() is almost certainly overkill. But potentially a > > very large loop so worth being careful. > > Yes, could be a large loop so doesn't harm. > > Is the ringbuffer large enough to avoid stalls in that path? Is any ringbuffer large enough to be able to copy between 2 128PiB objects? It works out that we can only handle ~500M copies before the buffer is full and we must clear a chunk. Switching to a new context at that point would keep on extending it until we run out of lmem and/or smem, but the cost of fabricating a new context at that point makes it dubious. I think that the greater impact would be from different clients using the global context; fighting over the ringbuffer is still an improvement over everyone doing a synchronous copy and blocking all, but the single timeline still acts as a global serialisation and priority barrier. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx