From: Oscar Mateo <oscar.mateo@xxxxxxxxx> Hi all, This patch series implement execlists for GEN8+. Before continuing, it is important to mention that I might have taken upon myself to assemble the series and rewrite it for upstreaming, but many people have worked on this series before me. Namely: Ben Widawsky (benjamin.widawsky@xxxxxxxxx). Jesse Barnes (jbarnes@xxxxxxxxxxxxxxxx). Michel Thierry (michel.thierry@xxxxxxxxx). Thomas Daniel (thomas.daniel@xxxxxxxxx). Rafael Barbalho (rafael.barbalho@xxxxxxxxx). All good ideas in the series belong to these authors, and so I have tried to maintain authorship in the patches accordingly (to the extent possible, since the patches have suffered a lot of squashing & splitting). These authors do not, however, bear any of the blame for errors: I am solely responsible for them. Now, let's get back to the subject at hand: With GEN8 comes an expansion of the HW contexts: "Logical Ring Contexts". One of the main differences with the legacy HW contexts is that logical ring contexts incorporate many more things to the context's state, like PDPs or ringbuffer control registers. These logical ring contexts enable a number of new abilities, especially "Execlists". Execlists are the new method by which, on GEN8+ hardware, workloads are submitted for execution (as opposed to the legacy, ringbuffer-based). With this new method, commands in the context's ringbuffer are executed when the GPU moves to this context from a previous one (a.k.a. context switch). On a context switch, the GPU has to remember the current state of the context being switched out including the head and tail pointers of the ring buffer, so it: - Flushes the pipe. - Saves ringbuffer head pointer. - Saves engine state. Similarly, on a context restore (When a previously switched out context is resubmitted), the GPU restores the saved context and resumes execution where it stopped: - Restores PDPs and sets-up PPGTT. - Restores ringbuffer. - Restores engine state. The way in which contexts are submitted for execution is the GPU's ExecLists Submit Port (ELSP, for short). This port supports the submission of two contexts at a time, which are executed in a serial way (Context-0 first, Context-1 next) upon every context completion. The GPU keeps the software informed about the status of this list via context switch interrupts and context status buffers, to help software keep track of the progress. The existance of a second context ensures some useful work done in HW while the Context-0 switch status is being processed by SW. After Context-1 completion, HW goes IDLE if there is no further contexts scheduled in the ELSP. Every time a new Execution List is submitted to the ELSP where one of the contexts is already running will result in a Lite Restore (sampling of the new tail pointer). Regarding the creation of logical ring contexts, we had before (since PPGTT was introduced): - One global default context. - One private default context for each opened fd. - One extra private context for each context create ioctl call. The global default context existed for future shrinker usage as well as reset handling. At the same time, every file got it's own context, plus any number of extra contexts if the context create ioctl call was used by the userspace driver. These private contexts were the ones used by the driver for execbuffer calls. Now that ringbuffers belong per-context (and not per-engine, like before) and that contexts are uniquely tied to a given engine (and not reusable, like before) we need: - No. of engines global default contexts. - Up to no. of engines private default contexts for each opened fd. - Up to no. of engines extra private contexts for each context create ioctl call. Given that at creation time of a non-global context we don't know which engine is going to use it, we have implemented a deferred creation of logical ring contexts: the private default context starts its life as a hollow or blank holder, that gets populated once we receive an execbuffer ioctl (for a particular engine) on that fd. If later on we receive another execbuffer ioctl for a different engine, we create a second private default context and so on. The same rules apply to the create context ioctl call. Execlists have been implemented as follows: When a request is committed, its commands (the BB start and any leading or trailing commands, like the seqno breadcrumbs) are placed in the ringbuffer for the appropriate context. The tail pointer in the hardware context is not updated at this time, but instead, kept by the driver in the ringbuffer structure. A structure representing this execution request is added to a request queue for the appropriate engine: this structure contains a copy of the context's tail after the request was written to the ringbuffer and a pointer to the context itself. If the engine's request queue was empty before the request was added, the queue is processed immediately. Otherwise the queue will be processed during a context switch interrupt. In any case, elements on the queue will get sent (in pairs) to the ELSP with a globally unique 20-bits submission ID (constructed with the fd's ID, plus our own context ID, plus the engine's ID). When execution of a request completes, the GPU updates the context status buffer with a context complete event and generates a context switch interrupt. During context switch interrupt handling, the driver examines the context status events in the context status buffer: for each context complete event, if the announced ID matches that on the head of the request queue, then that request is retired and removed from the queue. After processing, if any requests were retired and the queue is not empty then a new execution list can be submitted. The two requests at the front of the queue are next to be submitted but since a context may not occur twice in an execution list, if subsequent requests have the same ID as the first then the two requests must be combined. This is done simply by discarding requests at the head of the queue until either only one requests is left (in which case we use a NULL second context) or the first two requests have unique IDs. By always executing the first two requests in the queue the driver ensures that the GPU is kept as busy as possible. In the case where a single context completes but a second context is still executing, the request for the second context will be at the head of the queue when we remove the first one. This request will then be resubmitted along with a new request for a different context, which will cause the hardware to continue executing the second request and queue the new request (the GPU detects the condition of a context getting preempted with the same context and optimizes the context switch flow by not doing preemption, but just sampling the new tail pointer). Because the GPU continues to execute while the context switch interrupt is being handled, there is a race condition where a second context completes while handling the completion of the previous. This results in the second context being resubmitted (potentially along with a third), and an extra context complete event for that context will occur. The request will be removed from the queue at the first context complete event, and the second context complete event will not result in removal of a request from the queue because the IDs of the request and the event will not match. Cheers, Oscar Ben Widawsky (15): drm/i915/bdw: Macro to distinguish LRCs (Logical Ring Contexts) drm/i915: s/for_each_ring/for_each_active_ring drm/i915: for_each_ring drm/i915: Extract trivial parts of ring init (early init) drm/i915/bdw: Rework init code for gen8 contexts drm/i915: Extract ringbuffer obj alloc & destroy drm/i915/bdw: LR context ring init drm/i915/bdw: GEN8 semaphoreless ring add request drm/i915/bdw: GEN8 new ring flush drm/i915/bdw: A bit more advanced context init/fini drm/i915/bdw: Allocate ringbuffer for LR contexts drm/i915/bdw: Populate LR contexts (somewhat) drm/i915/bdw: Status page for LR contexts drm/i915/bdw: Enable execlists in the hardware drm/i915/bdw: Implement context switching (somewhat) Michel Thierry (1): drm/i915/bdw: Get prepared for a two-stage execlist submit process Oscar Mateo (30): drm/i915: Simplify a couple of functions thanks to for_each_ring drm/i915/bdw: New file for logical ring contexts and execlists drm/i915: Make i915_gem_create_context outside accessible drm/i915: s/intel_ring_buffer/intel_engine drm/i915: Split the ringbuffers and the rings drm/i915: Rename functions that mention ringbuffers (meaning rings) drm/i915/bdw: Execlists ring tail writing drm/i915/bdw: Plumbing for user LR context switching drm/i915: s/__intel_ring_advance/intel_ringbuffer_advance_and_submit drm/i915/bdw: Write a new set of context-aware ringbuffer management functions drm/i915: Final touches to LR contexts plumbing and refactoring drm/i915/bdw: Set the request context information correctly in the LRC case drm/i915/bdw: Prepare for user-created LR contexts drm/i915/bdw: Start creating & destroying user LR contexts drm/i915/bdw: Pin context pages at context create time drm/i915/bdw: Extract LR context object populating drm/i915/bdw: Introduce dependent contexts drm/i915/bdw: Create stand-alone and dependent contexts drm/i915/bdw: Allow non-default, non-render user LR contexts drm/i915/bdw: Fix reset stats ioctl with LR contexts drm/i915: Allocate an integer ID for each new file descriptor drm/i915/bdw: Prepare for a 20-bits globally unique submission ID drm/i915/bdw: Swap the PPGTT PDPs, LRC style drm/i915/bdw: Write the tail pointer, LRC style drm/i915/bdw: Display execlists info in debugfs drm/i915/bdw: Display context ringbuffer info in debugfs drm/i915/bdw: Start queueing contexts to be submitted drm/i915/bdw: Always write seqno to default context drm/i915/bdw: Enable logical ring contexts drm/i915/bdw: Document execlists and logical ring contexts Thomas Daniel (3): drm/i915/bdw: Add forcewake lock around ELSP writes drm/i915/bdw: LR context switch interrupts drm/i915/bdw: Handle context switch events drivers/gpu/drm/i915/Makefile | 1 + drivers/gpu/drm/i915/i915_cmd_parser.c | 14 +- drivers/gpu/drm/i915/i915_debugfs.c | 103 +++- drivers/gpu/drm/i915/i915_dma.c | 57 +- drivers/gpu/drm/i915/i915_drv.h | 90 +++- drivers/gpu/drm/i915/i915_gem.c | 153 +++--- drivers/gpu/drm/i915/i915_gem_context.c | 109 ++-- drivers/gpu/drm/i915/i915_gem_execbuffer.c | 85 +-- drivers/gpu/drm/i915/i915_gem_gtt.c | 39 +- drivers/gpu/drm/i915/i915_gem_gtt.h | 2 +- drivers/gpu/drm/i915/i915_gpu_error.c | 12 +- drivers/gpu/drm/i915/i915_irq.c | 93 ++-- drivers/gpu/drm/i915/i915_lrc.c | 826 +++++++++++++++++++++++++++++ drivers/gpu/drm/i915/i915_reg.h | 10 + drivers/gpu/drm/i915/i915_trace.h | 26 +- drivers/gpu/drm/i915/intel_display.c | 26 +- drivers/gpu/drm/i915/intel_drv.h | 4 +- drivers/gpu/drm/i915/intel_overlay.c | 12 +- drivers/gpu/drm/i915/intel_pm.c | 18 +- drivers/gpu/drm/i915/intel_ringbuffer.c | 796 +++++++++++++++++---------- drivers/gpu/drm/i915/intel_ringbuffer.h | 187 ++++--- drivers/gpu/drm/i915/intel_uncore.c | 15 + 22 files changed, 2043 insertions(+), 635 deletions(-) create mode 100644 drivers/gpu/drm/i915/i915_lrc.c -- 1.9.0 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx