Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> writes: > From: Tomas Elf <tomas.elf@xxxxxxxxx> > > TDR = Timeout Detection and Recovery. > > This change introduces support for TDR-style per-engine reset as an initial, > less intrusive hang recovery option to be attempted before falling back to the > legacy full GPU reset recovery mode if necessary. Initially we're only > supporting gen8 but adding support for gen7 is straight-forward since we've > already established an extensible framework where gen7 support can be plugged > in (add corresponding versions of intel_ring_enable, intel_ring_disable, > intel_ring_save, intel_ring_restore, etc.). > > 1. Per-engine recovery vs. Full GPU recovery > > To capture the state of a single engine being detected as hung there is now a > new flag for every engine that can be set once the decision has been made to > schedule hang recovery for that particular engine. This patch only provides the > hang recovery path but not the hang detection integration so for now there is > no way of detecting individual engines as hung and targetting that individual > engine for per-engine hang recovery. > > The following algorithm is used to determine when to use which recovery mode > given that hang detection has somehow detected a hang on an individual engine > and given that per-engine hang recovery has been enabled (which it by default > is not): > > 1. The error handler checks all engines that have been marked as hung > by the hang checker and checks how long ago it was since it last > attempted to do per-engine hang recovery for each respective, currently > hung engine. If the measured time period is within a certain time > window, i.e. the last per-engine hang recovery was done too recently, > it is determined that the previously attempted per-engine hang recovery > was ineffective and the step is taken to promote the current hang to a > full GPU reset. The default value for this time window is 10 seconds, > meaning any hang happening within 10 seconds of a previous hang on the > same engine will be promoted to full GPU reset. (of course, as long as > the per-engine hang recovery option is disabled this won't matter and > the error handler will always go for legacy full GPU reset) > > 2. If the error handler determines that no currently hung engine has > recently had hang recovery a per-engine hang recovery is scheduled. > > 3. If the decision to go with per-engine hang recovery is not taken, or > if per-engine hang recovery is attempted but failed for whatever > reason, TDR falls back to legacy full GPU recovery. > > NOTE: Gen7 and earlier will always promote to full GPU reset since there is > currently no per-engine reset support for these gens. > > 2. Context Submission Status Consistency. > > Per-engine hang recovery on gen8 (or execlist submission mode in general) > relies on the basic concept of context submission status consistency. What this > means is that we make sure that the status of the hardware and the driver when > it comes to the submission of the currently running context on any engine is > consistent. For example, when submitting a context to the corresponding ELSP > port of an engine we expect the owning request of that context to be at the > head of the corresponding execution list queue. Likewise, as long as the > context is executing on the GPU we expect the EXECLIST_STATUS register and the > context status buffer (CSB) to reflect this. Thus, if the context submission > status is consistent the ID of the currently executing context should be in > EXECLIST_STATUS and it should be consistent with the context of the head > request element in the execution list queue corresponding to that engine. > > The reason why this is important for per-engine hang recovery in execlist mode > is because this recovery mode relies on context resubmission in order to resume > execution following the recovery. If a context has been determined to be hung > and the per-engine hang recovery mode is engaged leading to the resubmission of > that context it's important that the hardware is in fact not busy doing > something else or is being idle since a resubmission during this state could > cause unforseen side-effects such as unexpected preemptions. > > There are rare, although consistently reproducable, situations that have shown > up in practice where the driver and hardware are no longer consistent with each > other, e.g. due to lost context completion interrupts after which the hardware > would be idle but the driver would still think that a context would still be > active. > > 3. There is a new reset path for engine reset alongside the legacy full GPU > reset path. This path does the following: > > 1) Check for context submission consistency to make sure that the > context that the hardware is currently stuck on is actually what the > driver is working on. If not then clearly we're not in a consistently > hung state and we bail out early. > > 2) Disable/idle the engine. This is done through reset handshaking on > gen8+ unlike earlier gens where this was done by clearing the ring > valid bits in MI_MODE and ring control registers, which are no longer > supported on gen8+. Reset handshaking translates to setting the reset > request bit in the reset control register. > > 3) Save the current engine state. What this translates to on gen8 is > simply to read the current value of the head register and nudge it so > that it points to the next valid instruction in the ring buffer. Since > we assume that the execution is currently stuck in a batch buffer the > effect of this is that the batchbuffer start instruction of the hung > batch buffer is skipped so that when execution resumes, following the > hang recovery completion, it resumes immediately following the batch > buffer. > > This effectively means that we're forcefully terminating the currently > active, hung batch buffer. Obviously, the outcome of this intervention > is potentially undefined but there are not many good options in this > scenario. It's better than resetting the entire GPU in the vast > majority of cases. > > Save the nudged head value to be applied later. > > 4) Reset the engine. > > 5) Apply the nudged head value to the head register. > > 6) Reenable the engine. For gen8 this means resubmitting the fixed-up > context, allowing execution to resume. In order to resubmit a context > without relying on the currently hung execlist queue we use a new, > privileged API that is dedicated to TDR use only. This submission API > bypasses any currently queued work and gets exclusive access to the > ELSP ports. > > 7) If the engine hang recovery procedure fails at any point in between > disablement and reenablement of the engine there is a back-off > procedure: For gen8 it's possible to back out of the reset handshake by > clearing the reset request bit in the reset control register. > > NOTE: > It's possible that some of Ben Widawsky's original per-engine reset patches > from 3 years ago are in this commit but since this work has gone through the > hands of at least 3 people already any kind of ownership tracking has been lost > a long time ago. If you think that you should be on the sob list just let me > know. > > * RFCv2: (Chris Wilson / Daniel Vetter) > - Simply use the previously private function i915_gem_reset_ring_status() from > the engine hang recovery path to set active/pending context status. This > replicates the same behaviour as in full GPU reset but for a single, > targetted engine. > > - Remove all additional uevents for both full GPU reset and per-engine reset. > Adapted uevent behaviour to the new per-engine hang recovery mode in that it > will only send one uevent regardless of which form of recovery is employed. > If a per-engine reset is attempted first then one uevent will be dispatched. > If that recovery mode fails and the hang is promoted to a full GPU reset no > further uevents will be dispatched at that point. > > - Tidied up the TDR context resubmission path in intel_lrc.c . Reduced the > amount of duplication by relying entirely on the normal unqueue function. > Added a new parameter to the unqueue function that takes into consideration > if the unqueue call is for a first-time context submission or a resubmission > and adapts the handling of elsp_submitted accordingly. The reason for > this is that for context resubmission we don't expect any further > interrupts for the submission or the following context completion. A more > elegant way of handling this would be to phase out elsp_submitted > altogether, however that's part of a LRC/execlist cleanup effort that is > happening independently of this patch series. For now we make this change > as simple as possible with as few non-TDR-related side-effects as > possible. > > Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx> > Signed-off-by: Ian Lister <ian.lister@xxxxxxxxx> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > Signed-off-by: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_dma.c | 18 + > drivers/gpu/drm/i915/i915_drv.c | 206 ++++++++++++ > drivers/gpu/drm/i915/i915_drv.h | 58 ++++ > drivers/gpu/drm/i915/i915_irq.c | 169 +++++++++- > drivers/gpu/drm/i915/i915_params.c | 19 ++ > drivers/gpu/drm/i915/i915_params.h | 2 + > drivers/gpu/drm/i915/i915_reg.h | 2 + > drivers/gpu/drm/i915/intel_lrc.c | 565 +++++++++++++++++++++++++++++++- > drivers/gpu/drm/i915/intel_lrc.h | 14 + > drivers/gpu/drm/i915/intel_lrc_tdr.h | 36 ++ > drivers/gpu/drm/i915/intel_ringbuffer.c | 84 ++++- > drivers/gpu/drm/i915/intel_ringbuffer.h | 64 ++++ > drivers/gpu/drm/i915/intel_uncore.c | 147 +++++++++ > 13 files changed, 1358 insertions(+), 26 deletions(-) > create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h > 1332 lines of new code in a single patch. We need to figure out how to split this. The context register write/read code and related macros are not needed anymore so that will lessen the lines alot. But some random comments for round two inlined below... > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c > index 44a896c..c45ec353 100644 > --- a/drivers/gpu/drm/i915/i915_dma.c > +++ b/drivers/gpu/drm/i915/i915_dma.c > @@ -837,6 +837,22 @@ static void intel_device_info_runtime_init(struct drm_device *dev) > info->has_eu_pg ? "y" : "n"); > } > > +static void > +i915_hangcheck_init(struct drm_device *dev) > +{ > + int i; > + struct drm_i915_private *dev_priv = dev->dev_private; > + > + for (i = 0; i < I915_NUM_RINGS; i++) { > + struct intel_engine_cs *engine = &dev_priv->ring[i]; > + struct intel_ring_hangcheck *hc = &engine->hangcheck; > + > + i915_hangcheck_reinit(engine); intel_engine_init_hangcheck(engine); > + hc->reset_count = 0; > + hc->tdr_count = 0; > + } > +} > + > static void intel_init_dpio(struct drm_i915_private *dev_priv) > { > /* > @@ -1034,6 +1050,8 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags) > > i915_gem_load(dev); > > + i915_hangcheck_init(dev); > + > /* On the 945G/GM, the chipset reports the MSI capability on the > * integrated graphics even though the support isn't actually there > * according to the published specs. It doesn't appear to function > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c > index f17a2b0..c0ad003 100644 > --- a/drivers/gpu/drm/i915/i915_drv.c > +++ b/drivers/gpu/drm/i915/i915_drv.c > @@ -34,6 +34,7 @@ > #include "i915_drv.h" > #include "i915_trace.h" > #include "intel_drv.h" > +#include "intel_lrc_tdr.h" We want to push pre gen 8 stuff also to here, atleast eventually. So #include "intel_tdr.h" > > #include <linux/console.h> > #include <linux/module.h> > @@ -571,6 +572,7 @@ static int i915_drm_suspend(struct drm_device *dev) > struct drm_i915_private *dev_priv = dev->dev_private; > pci_power_t opregion_target_state; > int error; > + int i; > > /* ignore lid events during suspend */ > mutex_lock(&dev_priv->modeset_restore_lock); > @@ -596,6 +598,16 @@ static int i915_drm_suspend(struct drm_device *dev) > > intel_guc_suspend(dev); > > + /* > + * Clear any pending reset requests. They should be picked up > + * after resume when new work is submitted > + */ > + for (i = 0; i < I915_NUM_RINGS; i++) > + atomic_set(&dev_priv->ring[i].hangcheck.flags, 0); This will cause havoc if you ever expand the flag space. If the comment says that you want to clear pending resets, then clear it with mask. > + > + atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG, > + &dev_priv->gpu_error.reset_counter); > + > intel_suspend_gt_powersave(dev); > > /* > @@ -948,6 +960,200 @@ int i915_reset(struct drm_device *dev) > return 0; > } > > +/** > + * i915_reset_engine - reset GPU engine after a hang > + * @engine: engine to reset > + * > + * Reset a specific GPU engine. Useful if a hang is detected. Returns zero on successful > + * reset or otherwise an error code. > + * > + * Procedure is fairly simple: > + * > + * - Force engine to idle. > + * > + * - Save current head register value and nudge it past the point of the hang in the > + * ring buffer, which is typically the BB_START instruction of the hung batch buffer, > + * on to the following instruction. > + * > + * - Reset engine. > + * > + * - Restore the previously saved, nudged head register value. > + * > + * - Re-enable engine to resume running. On gen8 this requires the previously hung > + * context to be resubmitted to ELSP via the dedicated TDR-execlists interface. > + * > + */ > +int i915_reset_engine(struct intel_engine_cs *engine) > +{ > + struct drm_device *dev = engine->dev; > + struct drm_i915_private *dev_priv = dev->dev_private; > + struct drm_i915_gem_request *current_request = NULL; > + uint32_t head; > + bool force_advance = false; > + int ret = 0; > + int err_ret = 0; > + > + WARN_ON(!mutex_is_locked(&dev->struct_mutex)); > + > + /* Take wake lock to prevent power saving mode */ > + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL); > + > + i915_gem_reset_ring_status(dev_priv, engine); > > + if (i915.enable_execlists) { > + enum context_submission_status status = > + intel_execlists_TDR_get_current_request(engine, NULL); > + > + /* > + * If the context submission state in hardware is not > + * consistent with the the corresponding state in the driver or > + * if there for some reason is no current context in the > + * process of being submitted then bail out and try again. Do > + * not proceed unless we have reliable current context state > + * information. The reason why this is important is because > + * per-engine hang recovery relies on context resubmission in > + * order to force the execution to resume following the hung > + * batch buffer. If the hardware is not currently running the > + * same context as the driver thinks is hung then anything can > + * happen at the point of context resubmission, e.g. unexpected > + * preemptions or the previously hung context could be > + * submitted when the hardware is idle which makes no sense. > + */ > + if (status != CONTEXT_SUBMISSION_STATUS_OK) { > + ret = -EAGAIN; > + goto reset_engine_error; > + } > + } This whole ambivalence troubles me. If our hangcheck part is lacking so that it will reset engines that really are not stuck, then we should move/improve this logic in hangcheck side. We are juggling here with the the execlist lock inside the intel_execlist_TDR_get_current_request and on multiple calls to that. We need to hold the execlist lock during the state save and restore. > + > + ret = intel_ring_disable(engine); > + if (ret != 0) { > + DRM_ERROR("Failed to disable %s\n", engine->name); > + goto reset_engine_error; > + } > + > + if (i915.enable_execlists) { > + enum context_submission_status status; > + bool inconsistent; > + > + status = intel_execlists_TDR_get_current_request(engine, > + ¤t_request); > + intel_execlist_get_current_request() intel_execlist_get_submission_status() if we have lock, no need to do everything in same function. And move the referencing of current_request up to this context as the unreferencing is already here. > + inconsistent = (status != CONTEXT_SUBMISSION_STATUS_OK); > + if (inconsistent) { > + /* > + * If we somehow have reached this point with > + * an inconsistent context submission status then > + * back out of the previously requested reset and > + * retry later. > + */ > + WARN(inconsistent, > + "Inconsistent context status on %s: %u\n", > + engine->name, status); > + > + ret = -EAGAIN; > + goto reenable_reset_engine_error; > + } > + } > + > + /* Sample the current ring head position */ > + head = I915_READ_HEAD(engine) & HEAD_ADDR; intel_ring_get_active_head(engine); > + > + if (head == engine->hangcheck.last_head) { > + /* > + * The engine has not advanced since the last > + * time it hung so force it to advance to the > + * next QWORD. In most cases the engine head > + * pointer will automatically advance to the > + * next instruction as soon as it has read the > + * current instruction, without waiting for it > + * to complete. This seems to be the default > + * behaviour, however an MBOX wait inserted > + * directly to the VCS/BCS engines does not behave > + * in the same way, instead the head pointer > + * will still be pointing at the MBOX instruction > + * until it completes. > + */ > + force_advance = true; > + } > + > + engine->hangcheck.last_head = head; > + > + ret = intel_ring_save(engine, current_request, force_advance); intel_engine_save() > + if (ret) { > + DRM_ERROR("Failed to save %s engine state\n", engine->name); > + goto reenable_reset_engine_error; > + } > + > + ret = intel_gpu_engine_reset(engine); intel_engine_reset() > + if (ret) { > + DRM_ERROR("Failed to reset %s\n", engine->name); > + goto reenable_reset_engine_error; > + } > + > + ret = intel_ring_restore(engine, current_request); intel_engine_restore() > + if (ret) { > + DRM_ERROR("Failed to restore %s engine state\n", engine->name); > + goto reenable_reset_engine_error; > + } > + > + /* Correct driver state */ > + intel_gpu_engine_reset_resample(engine, current_request); This looks like it resamples the head. intel_engine_reset_head() > + > + /* > + * Reenable engine > + * > + * In execlist mode on gen8+ this is implicit by simply resubmitting > + * the previously hung context. In ring buffer submission mode on gen7 > + * and earlier we need to actively turn on the engine first. > + */ > + if (i915.enable_execlists) > + intel_execlists_TDR_context_resubmission(engine); intel_logical_ring_enable()? > + else > + ret = intel_ring_enable(engine); > + > + if (ret) { > + DRM_ERROR("Failed to enable %s again after reset\n", > + engine->name); > + > + goto reset_engine_error; > + } > + > + /* Clear reset flags to allow future hangchecks */ > + atomic_set(&engine->hangcheck.flags, 0); > + > + /* Wake up anything waiting on this engine's queue */ > + wake_up_all(&engine->irq_queue); > + > + if (i915.enable_execlists && current_request) > + i915_gem_request_unreference(current_request); > + > + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > + reset_engine_error: is identical to code block above. > + return ret; > + > +reenable_reset_engine_error: > + > + err_ret = intel_ring_enable(engine); > + if (err_ret) > + DRM_ERROR("Failed to reenable %s following error during reset (%d)\n", > + engine->name, err_ret); > + > +reset_engine_error: > + > + /* Clear reset flags to allow future hangchecks */ > + atomic_set(&engine->hangcheck.flags, 0); > + > + /* Wake up anything waiting on this engine's queue */ > + wake_up_all(&engine->irq_queue); > + > + if (i915.enable_execlists && current_request) > + i915_gem_request_unreference(current_request); > + > + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > + > + return ret; > +} > + > static int i915_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent) > { > struct intel_device_info *intel_info = > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index 703a320..e866f14 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -2432,6 +2432,48 @@ struct drm_i915_cmd_table { > int count; > }; > > +/* > + * Context submission status > + * > + * CONTEXT_SUBMISSION_STATUS_OK: > + * Context submitted to ELSP and state of execlist queue is the same as > + * the state of EXECLIST_STATUS register. Software and hardware states > + * are consistent and can be trusted. > + * > + * CONTEXT_SUBMISSION_STATUS_INCONSISTENT: > + * Context has been submitted to the execlist queue but the state of the > + * EXECLIST_STATUS register is different from the execlist queue state. > + * This could mean any of the following: > + * > + * 1. The context is in the head position of the execlist queue > + * but has not yet been submitted to ELSP. > + * > + * 2. The hardware just recently completed the context but the > + * context is pending removal from the execlist queue. > + * > + * 3. The driver has lost a context state transition interrupt. > + * Typically what this means is that hardware has completed and > + * is now idle but the driver thinks the hardware is still > + * busy. > + * > + * Overall what this means is that the context submission status is > + * currently in transition and cannot be trusted until it settles down. > + * > + * CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED: > + * No context submitted to the execlist queue and the EXECLIST_STATUS > + * register shows no context being processed. > + * > + * CONTEXT_SUBMISSION_STATUS_NONE_UNDEFINED: > + * Initial state before submission status has been determined. > + * > + */ > +enum context_submission_status { > + CONTEXT_SUBMISSION_STATUS_OK = 0, > + CONTEXT_SUBMISSION_STATUS_INCONSISTENT, > + CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED, > + CONTEXT_SUBMISSION_STATUS_UNDEFINED > +}; > + > /* Note that the (struct drm_i915_private *) cast is just to shut up gcc. */ > #define __I915__(p) ({ \ > struct drm_i915_private *__p; \ > @@ -2690,8 +2732,12 @@ extern long i915_compat_ioctl(struct file *filp, unsigned int cmd, > unsigned long arg); > #endif > extern int intel_gpu_reset(struct drm_device *dev); > +extern int intel_gpu_engine_reset(struct intel_engine_cs *engine); > +extern int intel_request_gpu_engine_reset(struct intel_engine_cs *engine); > +extern int intel_unrequest_gpu_engine_reset(struct intel_engine_cs *engine); > extern bool intel_has_gpu_reset(struct drm_device *dev); > extern int i915_reset(struct drm_device *dev); > +extern int i915_reset_engine(struct intel_engine_cs *engine); > extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv); > extern unsigned long i915_mch_val(struct drm_i915_private *dev_priv); > extern unsigned long i915_gfx_val(struct drm_i915_private *dev_priv); > @@ -2704,6 +2750,18 @@ void intel_hpd_init(struct drm_i915_private *dev_priv); > void intel_hpd_init_work(struct drm_i915_private *dev_priv); > void intel_hpd_cancel_work(struct drm_i915_private *dev_priv); > bool intel_hpd_pin_to_port(enum hpd_pin pin, enum port *port); > +static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine) > +{ > + struct intel_ring_hangcheck *hc = &engine->hangcheck; > + > + hc->acthd = 0; > + hc->max_acthd = 0; > + hc->seqno = 0; > + hc->score = 0; > + hc->action = HANGCHECK_IDLE; > + hc->deadlock = 0; > +} > + Rename to intel_engine_hangcheck_init and to intel_ringbuffer.c > > /* i915_irq.c */ > void i915_queue_hangcheck(struct drm_device *dev); > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index f04d799..6a0ec37 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -2470,10 +2470,70 @@ static void i915_reset_and_wakeup(struct drm_device *dev) > char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; > char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; > char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL }; > - int ret; > + bool reset_complete = false; > + struct intel_engine_cs *ring; > + int ret = 0; > + int i; > + > + mutex_lock(&dev->struct_mutex); > > kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, error_event); > > + for_each_ring(ring, dev_priv, i) { > + > + /* > + * Skip further individual engine reset requests if full GPU > + * reset requested. > + */ > + if (i915_reset_in_progress(error)) > + break; > + > + if (atomic_read(&ring->hangcheck.flags) & > + I915_ENGINE_RESET_IN_PROGRESS) { > + > + if (!reset_complete) > + kobject_uevent_env(&dev->primary->kdev->kobj, > + KOBJ_CHANGE, > + reset_event); > + > + reset_complete = true; > + > + ret = i915_reset_engine(ring); > + > + /* > + * Execlist mode only: > + * > + * -EAGAIN means that between detecting a hang (and > + * also determining that the currently submitted > + * context is stable and valid) and trying to recover > + * from the hang the current context changed state. > + * This means that we are probably not completely hung > + * after all. Just fail and retry by exiting all the > + * way back and wait for the next hang detection. If we > + * have a true hang on our hands then we will detect it > + * again, otherwise we will continue like nothing > + * happened. > + */ > + if (ret == -EAGAIN) { > + DRM_ERROR("Reset of %s aborted due to " \ > + "change in context submission " \ > + "state - retrying!", ring->name); > + ret = 0; > + } > + > + if (ret) { > + DRM_ERROR("Reset of %s failed! (%d)", ring->name, ret); > + > + atomic_or(I915_RESET_IN_PROGRESS_FLAG, > + &dev_priv->gpu_error.reset_counter); > + break; > + } > + } > + } > + > + /* The full GPU reset will grab the struct_mutex when it needs it */ > + mutex_unlock(&dev->struct_mutex); > + > /* > * Note that there's only one work item which does gpu resets, so we > * need not worry about concurrent gpu resets potentially incrementing > @@ -2486,8 +2546,13 @@ static void i915_reset_and_wakeup(struct drm_device *dev) > */ > if (i915_reset_in_progress(error) && !i915_terminally_wedged(error)) { > DRM_DEBUG_DRIVER("resetting chip\n"); > - kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, > - reset_event); > + > + if (!reset_complete) > + kobject_uevent_env(&dev->primary->kdev->kobj, > + KOBJ_CHANGE, > + reset_event); > + > + reset_complete = true; > > /* > * In most cases it's guaranteed that we get here with an RPM > @@ -2520,23 +2585,36 @@ static void i915_reset_and_wakeup(struct drm_device *dev) > * > * Since unlock operations are a one-sided barrier only, > * we need to insert a barrier here to order any seqno > - * updates before > - * the counter increment. > + * updates before the counter increment. > + * > + * The increment clears I915_RESET_IN_PROGRESS_FLAG. > */ > smp_mb__before_atomic(); > atomic_inc(&dev_priv->gpu_error.reset_counter); > > - kobject_uevent_env(&dev->primary->kdev->kobj, > - KOBJ_CHANGE, reset_done_event); > + /* > + * If any per-engine resets were promoted to full GPU > + * reset don't forget to clear those reset flags. > + */ > + for_each_ring(ring, dev_priv, i) > + atomic_set(&ring->hangcheck.flags, 0); > } else { > + /* Terminal wedge condition */ > + WARN(1, "i915_reset failed, declaring GPU as wedged!\n"); > atomic_or(I915_WEDGED, &error->reset_counter); > } > + } > > - /* > - * Note: The wake_up also serves as a memory barrier so that > - * waiters see the update value of the reset counter atomic_t. > - */ > + /* > + * Note: The wake_up also serves as a memory barrier so that > + * waiters see the update value of the reset counter atomic_t. > + */ > + if (reset_complete) { > i915_error_wake_up(dev_priv, true); > + > + if (ret == 0) > + kobject_uevent_env(&dev->primary->kdev->kobj, > + KOBJ_CHANGE, reset_done_event); > } > } > > @@ -2649,6 +2727,14 @@ void i915_handle_error(struct drm_device *dev, bool wedged, > va_list args; > char error_msg[80]; > > + struct intel_engine_cs *engine; > + > + /* > + * NB: Placeholder until the hang checker supports > + * per-engine hang detection. > + */ > + u32 engine_mask = 0; > + > va_start(args, fmt); > vscnprintf(error_msg, sizeof(error_msg), fmt, args); > va_end(args); > @@ -2657,8 +2743,65 @@ void i915_handle_error(struct drm_device *dev, bool wedged, > i915_report_and_clear_eir(dev); > > if (wedged) { > - atomic_or(I915_RESET_IN_PROGRESS_FLAG, > - &dev_priv->gpu_error.reset_counter); > + /* > + * Defer to full GPU reset if any of the following is true: > + * 0. Engine reset disabled. > + * 1. The caller did not ask for per-engine reset. > + * 2. The hardware does not support it (pre-gen7). > + * 3. We already tried per-engine reset recently. > + */ > + bool full_reset = true; > + > + if (!i915.enable_engine_reset) { > + DRM_INFO("Engine reset disabled: Using full GPU reset.\n"); > + engine_mask = 0x0; > + } > + > + /* > + * TBD: We currently only support per-engine reset for gen8+. > + * Implement support for gen7. > + */ > + if (engine_mask && (INTEL_INFO(dev)->gen >= 8)) { > + u32 i; > + > + for_each_ring(engine, dev_priv, i) { > + u32 now, last_engine_reset_timediff; > + > + if (!(intel_ring_flag(engine) & engine_mask)) > + continue; > + > + /* Measure the time since this engine was last reset */ > + now = get_seconds(); > + last_engine_reset_timediff = > + now - engine->hangcheck.last_engine_reset_time; > + > + full_reset = last_engine_reset_timediff < > + i915.gpu_reset_promotion_time; > + > + engine->hangcheck.last_engine_reset_time = now; > + > + /* > + * This engine was not reset too recently - go ahead > + * with engine reset instead of falling back to full > + * GPU reset. > + * > + * Flag that we want to try and reset this engine. > + * This can still be overridden by a global > + * reset e.g. if per-engine reset fails. > + */ > + if (!full_reset) > + atomic_or(I915_ENGINE_RESET_IN_PROGRESS, > + &engine->hangcheck.flags); > + else > + break; > + > + } /* for_each_ring */ > + } > + > + if (full_reset) { > + atomic_or(I915_RESET_IN_PROGRESS_FLAG, > + &dev_priv->gpu_error.reset_counter); > + } > > /* > * Wakeup waiting processes so that the reset function > diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c > index 8d90c25..5cf9c11 100644 > --- a/drivers/gpu/drm/i915/i915_params.c > +++ b/drivers/gpu/drm/i915/i915_params.c > @@ -37,6 +37,8 @@ struct i915_params i915 __read_mostly = { > .enable_fbc = -1, > .enable_execlists = -1, > .enable_hangcheck = true, > + .enable_engine_reset = false, > + .gpu_reset_promotion_time = 10, > .enable_ppgtt = -1, > .enable_psr = 0, > .preliminary_hw_support = IS_ENABLED(CONFIG_DRM_I915_PRELIMINARY_HW_SUPPORT), > @@ -116,6 +118,23 @@ MODULE_PARM_DESC(enable_hangcheck, > "WARNING: Disabling this can cause system wide hangs. " > "(default: true)"); > > +module_param_named_unsafe(enable_engine_reset, i915.enable_engine_reset, bool, 0644); > +MODULE_PARM_DESC(enable_engine_reset, > + "Enable GPU engine hang recovery mode. Used as a soft, low-impact form " > + "of hang recovery that targets individual GPU engines rather than the " > + "entire GPU" > + "(default: false)"); > + > +module_param_named(gpu_reset_promotion_time, > + i915.gpu_reset_promotion_time, int, 0644); > +MODULE_PARM_DESC(gpu_reset_promotion_time, > + "Catch excessive engine resets. Each engine maintains a " > + "timestamp of the last time it was reset. If it hangs again " > + "within this period then fall back to full GPU reset to try and" > + " recover from the hang. Only applicable if enable_engine_reset " > + "is enabled." > + "default=10 seconds"); > + > module_param_named_unsafe(enable_ppgtt, i915.enable_ppgtt, int, 0400); > MODULE_PARM_DESC(enable_ppgtt, > "Override PPGTT usage. " > diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h > index 5299290..60f3d23 100644 > --- a/drivers/gpu/drm/i915/i915_params.h > +++ b/drivers/gpu/drm/i915/i915_params.h > @@ -49,8 +49,10 @@ struct i915_params { > int use_mmio_flip; > int mmio_debug; > int edp_vswing; > + unsigned int gpu_reset_promotion_time; > /* leave bools at the end to not create holes */ > bool enable_hangcheck; > + bool enable_engine_reset; > bool fastboot; > bool prefault_disable; > bool load_detect_test; > diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h > index 0a98889..3fc5d75 100644 > --- a/drivers/gpu/drm/i915/i915_reg.h > +++ b/drivers/gpu/drm/i915/i915_reg.h > @@ -164,6 +164,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg) > #define GEN6_GRDOM_RENDER (1 << 1) > #define GEN6_GRDOM_MEDIA (1 << 2) > #define GEN6_GRDOM_BLT (1 << 3) > +#define GEN6_GRDOM_VECS (1 << 4) > +#define GEN8_GRDOM_MEDIA2 (1 << 7) > > #define RING_PP_DIR_BASE(ring) _MMIO((ring)->mmio_base+0x228) > #define RING_PP_DIR_BASE_READ(ring) _MMIO((ring)->mmio_base+0x518) > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index ab344e0..fcec476 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -136,6 +136,7 @@ > #include <drm/i915_drm.h> > #include "i915_drv.h" > #include "intel_mocs.h" > +#include "intel_lrc_tdr.h" > > #define GEN9_LR_CONTEXT_RENDER_SIZE (22 * PAGE_SIZE) > #define GEN8_LR_CONTEXT_RENDER_SIZE (20 * PAGE_SIZE) > @@ -325,7 +326,8 @@ uint64_t intel_lr_context_descriptor(struct intel_context *ctx, > } > > static void execlists_elsp_write(struct drm_i915_gem_request *rq0, > - struct drm_i915_gem_request *rq1) > + struct drm_i915_gem_request *rq1, > + bool tdr_resubmission) > { > > struct intel_engine_cs *ring = rq0->ring; > @@ -335,13 +337,17 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0, > > if (rq1) { > desc[1] = intel_lr_context_descriptor(rq1->ctx, rq1->ring); > - rq1->elsp_submitted++; > + > + if (!tdr_resubmission) > + rq1->elsp_submitted++; > } else { > desc[1] = 0; > } > > desc[0] = intel_lr_context_descriptor(rq0->ctx, rq0->ring); > - rq0->elsp_submitted++; > + > + if (!tdr_resubmission) > + rq0->elsp_submitted++; > > /* You must always write both descriptors in the order below. */ > spin_lock(&dev_priv->uncore.lock); > @@ -359,6 +365,182 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0, > spin_unlock(&dev_priv->uncore.lock); > } > > +/** > + * execlist_get_context_reg_page() - Get memory page for context object > + * @engine: engine > + * @ctx: context running on engine > + * @page: returned page > + * > + * Return: 0 if successful, otherwise propagates error codes. > + */ > +static inline int execlist_get_context_reg_page(struct intel_engine_cs *engine, > + struct intel_context *ctx, > + struct page **page) > +{ All the macros and reg_page stuff can be removed as there is ctx->engine[id].lrc_reg_state for pinned ctx objects. > + struct drm_i915_gem_object *ctx_obj; > + > + if (!page) > + return -EINVAL; > + > + if (!ctx) > + ctx = engine->default_context; > + No. Add a warn which triggers if someone tries to touch the default_context through this mechanism. Default should be sacred, we don't want any state to accidentally creep into it. > + ctx_obj = ctx->engine[engine->id].state; > + > + if (WARN(!ctx_obj, "Context object not set up!\n")) > + return -EINVAL; > + > + WARN(!i915_gem_obj_is_pinned(ctx_obj), > + "Context object is not pinned!\n"); > + > + *page = i915_gem_object_get_page(ctx_obj, LRC_STATE_PN); > + > + if (WARN(!*page, "Context object page could not be resolved!\n")) > + return -EINVAL; > + > + return 0; > +} > + > +/** > + * execlist_write_context_reg() - Write value to Context register > + * @engine: Engine > + * @ctx: Context running on engine > + * @ctx_reg: Index into context image pointing to register location > + * @mmio_reg_addr: MMIO register address > + * @val: Value to be written > + * @mmio_reg_name_str: Designated register name > + * > + * Return: 0 if successful, otherwise propagates error codes. > + */ > +static inline int execlists_write_context_reg(struct intel_engine_cs *engine, > + struct intel_context *ctx, > + u32 ctx_reg, > + i915_reg_t mmio_reg, > + u32 val, > + const char *mmio_reg_name_str) > +{ > + struct page *page = NULL; > + uint32_t *reg_state; > + > + int ret = execlist_get_context_reg_page(engine, ctx, &page); > + if (WARN(ret, "[write %s:%u] Failed to get context memory page for %s!\n", > + mmio_reg_name_str, (unsigned int) mmio_reg.reg, engine->name)) { > + return ret; > + } > + > + reg_state = kmap_atomic(page); > + > + WARN(reg_state[ctx_reg] != mmio_reg.reg, > + "[write %s:%u]: Context reg addr (%x) != MMIO reg addr (%x)!\n", > + mmio_reg_name_str, > + (unsigned int) mmio_reg.reg, > + (unsigned int) reg_state[ctx_reg], > + (unsigned int) mmio_reg.reg); > + > + reg_state[ctx_reg+1] = val; > + kunmap_atomic(reg_state); > + > + return ret; > +} > + > +/** > + * execlist_read_context_reg() - Read value from Context register > + * @engine: Engine > + * @ctx: Context running on engine > + * @ctx_reg: Index into context image pointing to register location > + * @mmio_reg: MMIO register struct > + * @val: Output parameter returning register value > + * @mmio_reg_name_str: Designated register name > + * > + * Return: 0 if successful, otherwise propagates error codes. > + */ > +static inline int execlists_read_context_reg(struct intel_engine_cs *engine, > + struct intel_context *ctx, > + u32 ctx_reg, > + i915_reg_t mmio_reg, > + u32 *val, > + const char *mmio_reg_name_str) > +{ > + struct page *page = NULL; > + uint32_t *reg_state; > + int ret = 0; > + > + if (!val) > + return -EINVAL; > + > + ret = execlist_get_context_reg_page(engine, ctx, &page); > + if (WARN(ret, "[read %s:%u] Failed to get context memory page for %s!\n", > + mmio_reg_name_str, (unsigned int) mmio_reg.reg, engine->name)) { > + return ret; > + } > + > + reg_state = kmap_atomic(page); > + > + WARN(reg_state[ctx_reg] != mmio_reg.reg, > + "[read %s:%u]: Context reg addr (%x) != MMIO reg addr (%x)!\n", > + mmio_reg_name_str, > + (unsigned int) ctx_reg, > + (unsigned int) reg_state[ctx_reg], > + (unsigned int) mmio_reg.reg); > + > + *val = reg_state[ctx_reg+1]; > + kunmap_atomic(reg_state); > + > + return ret; > + } > + > +/* > + * Generic macros for generating function implementation for context register > + * read/write functions. > + * > + * Macro parameters > + * ---------------- > + * reg_name: Designated name of context register (e.g. tail, head, buffer_ctl) > + * > + * reg_def: Context register macro definition (e.g. CTX_RING_TAIL) > + * > + * mmio_reg_def: Name of macro function used to determine the address > + * of the corresponding MMIO register (e.g. RING_TAIL, RING_HEAD). > + * This macro function is assumed to be defined on the form of: > + * > + * #define mmio_reg_def(base) (base+register_offset) > + * > + * Where "base" is the MMIO base address of the respective ring > + * and "register_offset" is the offset relative to "base". > + * > + * Function parameters > + * ------------------- > + * engine: The engine that the context is running on > + * ctx: The context of the register that is to be accessed > + * reg_name: Value to be written/read to/from the register. > + */ > +#define INTEL_EXECLISTS_WRITE_REG(reg_name, reg_def, mmio_reg_def) \ > + int intel_execlists_write_##reg_name(struct intel_engine_cs *engine, \ > + struct intel_context *ctx, \ > + u32 reg_name) \ > +{ \ > + return execlists_write_context_reg(engine, ctx, (reg_def), \ > + mmio_reg_def(engine->mmio_base), (reg_name), \ > + (#reg_name)); \ > +} > + > +#define INTEL_EXECLISTS_READ_REG(reg_name, reg_def, mmio_reg_def) \ > + int intel_execlists_read_##reg_name(struct intel_engine_cs *engine, \ > + struct intel_context *ctx, \ > + u32 *reg_name) \ > +{ \ > + return execlists_read_context_reg(engine, ctx, (reg_def), \ > + mmio_reg_def(engine->mmio_base), (reg_name), \ > + (#reg_name)); \ > +} > + > +INTEL_EXECLISTS_READ_REG(tail, CTX_RING_TAIL, RING_TAIL) > +INTEL_EXECLISTS_WRITE_REG(head, CTX_RING_HEAD, RING_HEAD) > +INTEL_EXECLISTS_READ_REG(head, CTX_RING_HEAD, RING_HEAD) > + > +#undef INTEL_EXECLISTS_READ_REG > +#undef INTEL_EXECLISTS_WRITE_REG > + > static int execlists_update_context(struct drm_i915_gem_request *rq) > { > struct intel_engine_cs *ring = rq->ring; > @@ -396,17 +578,18 @@ static int execlists_update_context(struct drm_i915_gem_request *rq) > } > > static void execlists_submit_requests(struct drm_i915_gem_request *rq0, > - struct drm_i915_gem_request *rq1) > + struct drm_i915_gem_request *rq1, > + bool tdr_resubmission) > { > execlists_update_context(rq0); > > if (rq1) > execlists_update_context(rq1); > > - execlists_elsp_write(rq0, rq1); > + execlists_elsp_write(rq0, rq1, tdr_resubmission); > } > > -static void execlists_context_unqueue(struct intel_engine_cs *ring) > +static void execlists_context_unqueue(struct intel_engine_cs *ring, bool tdr_resubmission) > { > struct drm_i915_gem_request *req0 = NULL, *req1 = NULL; > struct drm_i915_gem_request *cursor = NULL, *tmp = NULL; > @@ -440,6 +623,16 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring) > } > } > > + /* > + * Only do TDR resubmission of the second head request if it's already > + * been submitted. The intention is to restore the original submission > + * state from the situation when the hang originally happened. If it > + * was never submitted we don't want to submit it for the first time at > + * this point > + */ > + if (tdr_resubmission && req1 && !req1->elsp_submitted) > + req1 = NULL; > + > if (IS_GEN8(ring->dev) || IS_GEN9(ring->dev)) { > /* > * WaIdleLiteRestore: make sure we never cause a lite > @@ -460,9 +653,32 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring) > } > } > > - WARN_ON(req1 && req1->elsp_submitted); > + WARN_ON(req1 && req1->elsp_submitted && !tdr_resubmission); > > - execlists_submit_requests(req0, req1); > + execlists_submit_requests(req0, req1, tdr_resubmission); > +} > + > +/** > + * intel_execlists_TDR_context_resubmission() - ELSP context resubmission > + * @ring: engine to do resubmission for. > + * > + * Context submission mechanism exclusively used by TDR that bypasses the > + * execlist queue. This is necessary since at the point of TDR hang recovery > + * the hardware will be hung and resubmitting a fixed context (the context that > + * the TDR has identified as hung and fixed up in order to move past the > + * blocking batch buffer) to a hung execlist queue will lock up the TDR. > + * Instead, opt for direct ELSP submission without depending on the rest of the > + * driver. > + */ > +void intel_execlists_TDR_context_resubmission(struct intel_engine_cs *ring) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&ring->execlist_lock, flags); > + WARN_ON(list_empty(&ring->execlist_queue)); > + > + execlists_context_unqueue(ring, true); > + spin_unlock_irqrestore(&ring->execlist_lock, flags); > } > > static bool execlists_check_remove_request(struct intel_engine_cs *ring, > @@ -560,9 +776,9 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring) > /* Prevent a ctx to preempt itself */ > if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) && > (submit_contexts != 0)) > - execlists_context_unqueue(ring); > + execlists_context_unqueue(ring, false); > } else if (submit_contexts != 0) { > - execlists_context_unqueue(ring); > + execlists_context_unqueue(ring, false); > } > > spin_unlock(&ring->execlist_lock); > @@ -613,7 +829,7 @@ static int execlists_context_queue(struct drm_i915_gem_request *request) > > list_add_tail(&request->execlist_link, &ring->execlist_queue); > if (num_elements == 0) > - execlists_context_unqueue(ring); > + execlists_context_unqueue(ring, false); > > spin_unlock_irq(&ring->execlist_lock); > > @@ -1536,7 +1752,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring) > ring->next_context_status_buffer = next_context_status_buffer_hw; > DRM_DEBUG_DRIVER("Execlists enabled for %s\n", ring->name); > > - memset(&ring->hangcheck, 0, sizeof(ring->hangcheck)); > + i915_hangcheck_reinit(ring); > > return 0; > } > @@ -1888,6 +2104,187 @@ out: > return ret; > } > > +static int > +gen8_ring_disable(struct intel_engine_cs *ring) > +{ > + intel_request_gpu_engine_reset(ring); > + return 0; > +} > + > +static int > +gen8_ring_enable(struct intel_engine_cs *ring) > +{ > + intel_unrequest_gpu_engine_reset(ring); > + return 0; > +} > + > +/** > + * gen8_ring_save() - save minimum engine state > + * @ring: engine whose state is to be saved > + * @req: request containing the context currently running on engine > + * @force_advance: indicates whether or not we should nudge the head > + * forward or not > + * > + * Saves the head MMIO register to scratch memory while engine is reset and > + * reinitialized. Before saving the head register we nudge the head position to > + * be correctly aligned with a QWORD boundary, which brings it up to the next > + * presumably valid instruction. Typically, at the point of hang recovery the > + * head register will be pointing to the last DWORD of the BB_START > + * instruction, which is followed by a padding MI_NOOP inserted by the > + * driver. > + * > + * Returns: > + * 0 if ok, otherwise propagates error codes. > + */ > +static int > +gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req, > + bool force_advance) > +{ > + struct drm_i915_private *dev_priv = ring->dev->dev_private; > + struct intel_ringbuffer *ringbuf = NULL; > + struct intel_context *ctx; > + int ret = 0; > + int clamp_to_tail = 0; > + uint32_t head; > + uint32_t tail; > + uint32_t head_addr; > + uint32_t tail_addr; > + > + if (WARN_ON(!req)) > + return -EINVAL; > + > + ctx = req->ctx; > + ringbuf = ctx->engine[ring->id].ringbuf; > + > + /* > + * Read head from MMIO register since it contains the > + * most up to date value of head at this point. > + */ > + head = I915_READ_HEAD(ring); > + > + /* > + * Read tail from the context because the execlist queue > + * updates the tail value there first during submission. > + * The MMIO tail register is not updated until the actual > + * ring submission completes. > + */ > + ret = I915_READ_TAIL_CTX(ring, ctx, tail); > + if (ret) > + return ret; > + > + /* > + * head_addr and tail_addr are the head and tail values > + * excluding ring wrapping information and aligned to DWORD > + * boundary > + */ > + head_addr = head & HEAD_ADDR; > + tail_addr = tail & TAIL_ADDR; > + > + /* > + * The head must always chase the tail. > + * If the tail is beyond the head then do not allow > + * the head to overtake it. If the tail is less than > + * the head then the tail has already wrapped and > + * there is no problem in advancing the head or even > + * wrapping the head back to 0 as worst case it will > + * become equal to tail > + */ > + if (head_addr <= tail_addr) > + clamp_to_tail = 1; > + > + if (force_advance) { > + > + /* Force head pointer to next QWORD boundary */ > + head_addr &= ~0x7; > + head_addr += 8; > + > + } else if (head & 0x7) { > + > + /* Ensure head pointer is pointing to a QWORD boundary */ > + head += 0x7; > + head &= ~0x7; > + head_addr = head; > + } > + > + if (clamp_to_tail && (head_addr > tail_addr)) { > + head_addr = tail_addr; > + } else if (head_addr >= ringbuf->size) { > + /* Wrap head back to start if it exceeds ring size */ > + head_addr = 0; > + } > + > + head &= ~HEAD_ADDR; > + head |= (head_addr & HEAD_ADDR); > + ring->saved_head = head; > + > + return 0; > +} > + > + > +/** > + * gen8_ring_restore() - restore previously saved engine state > + * @ring: engine whose state is to be restored > + * @req: request containing the context currently running on engine > + * > + * Reinitializes engine and restores the previously saved engine state. > + * See: gen8_ring_save() > + * > + * Returns: > + * 0 if ok, otherwise propagates error codes. > + */ > +static int > +gen8_ring_restore(struct intel_engine_cs *ring, struct drm_i915_gem_request *req) > +{ > + struct drm_i915_private *dev_priv = ring->dev->dev_private; > + struct intel_context *ctx; > + > + if (WARN_ON(!req)) > + return -EINVAL; > + > + ctx = req->ctx; > + > + /* Re-initialize ring */ > + if (ring->init_hw) { > + int ret = ring->init_hw(ring); > + if (ret != 0) { > + DRM_ERROR("Failed to re-initialize %s\n", > + ring->name); > + return ret; > + } > + } else { > + DRM_ERROR("ring init function pointer not set up\n"); > + return -EINVAL; > + } > + > + if (ring->id == RCS) { > + /* > + * These register reinitializations are only located here > + * temporarily until they are moved out of the > + * init_clock_gating function to some function we can > + * call from here. > + */ > + > + /* WaVSRefCountFullforceMissDisable:chv */ > + /* WaDSRefCountFullforceMissDisable:chv */ > + I915_WRITE(GEN7_FF_THREAD_MODE, > + I915_READ(GEN7_FF_THREAD_MODE) & > + ~(GEN8_FF_DS_REF_CNT_FFME | GEN7_FF_VS_REF_CNT_FFME)); > + > + I915_WRITE(_3D_CHICKEN3, > + _3D_CHICKEN_SDE_LIMIT_FIFO_POLY_DEPTH(2)); > + > + /* WaSwitchSolVfFArbitrationPriority:bdw */ > + I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) | HSW_ECOCHK_ARB_PRIO_SOL); > + } > + > + /* Restore head */ > + > + I915_WRITE_HEAD(ring, ring->saved_head); > + I915_WRITE_HEAD_CTX(ring, ctx, ring->saved_head); > + > + return 0; > +} > + > static int gen8_init_rcs_context(struct drm_i915_gem_request *req) > { > int ret; > @@ -2021,6 +2418,10 @@ static int logical_render_ring_init(struct drm_device *dev) > ring->irq_get = gen8_logical_ring_get_irq; > ring->irq_put = gen8_logical_ring_put_irq; > ring->emit_bb_start = gen8_emit_bb_start; > + ring->enable = gen8_ring_enable; > + ring->disable = gen8_ring_disable; > + ring->save = gen8_ring_save; > + ring->restore = gen8_ring_restore; > > ring->dev = dev; > > @@ -2073,6 +2474,10 @@ static int logical_bsd_ring_init(struct drm_device *dev) > ring->irq_get = gen8_logical_ring_get_irq; > ring->irq_put = gen8_logical_ring_put_irq; > ring->emit_bb_start = gen8_emit_bb_start; > + ring->enable = gen8_ring_enable; > + ring->disable = gen8_ring_disable; > + ring->save = gen8_ring_save; > + ring->restore = gen8_ring_restore; > > return logical_ring_init(dev, ring); > } > @@ -2098,6 +2503,10 @@ static int logical_bsd2_ring_init(struct drm_device *dev) > ring->irq_get = gen8_logical_ring_get_irq; > ring->irq_put = gen8_logical_ring_put_irq; > ring->emit_bb_start = gen8_emit_bb_start; > + ring->enable = gen8_ring_enable; > + ring->disable = gen8_ring_disable; > + ring->save = gen8_ring_save; > + ring->restore = gen8_ring_restore; > > return logical_ring_init(dev, ring); > } > @@ -2128,6 +2537,10 @@ static int logical_blt_ring_init(struct drm_device *dev) > ring->irq_get = gen8_logical_ring_get_irq; > ring->irq_put = gen8_logical_ring_put_irq; > ring->emit_bb_start = gen8_emit_bb_start; > + ring->enable = gen8_ring_enable; > + ring->disable = gen8_ring_disable; > + ring->save = gen8_ring_save; > + ring->restore = gen8_ring_restore; > > return logical_ring_init(dev, ring); > } > @@ -2158,6 +2571,10 @@ static int logical_vebox_ring_init(struct drm_device *dev) > ring->irq_get = gen8_logical_ring_get_irq; > ring->irq_put = gen8_logical_ring_put_irq; > ring->emit_bb_start = gen8_emit_bb_start; > + ring->enable = gen8_ring_enable; > + ring->disable = gen8_ring_disable; > + ring->save = gen8_ring_save; > + ring->restore = gen8_ring_restore; > > return logical_ring_init(dev, ring); > } > @@ -2587,3 +3004,127 @@ void intel_lr_context_reset(struct drm_device *dev, > ringbuf->tail = 0; > } > } > + > +/** > + * intel_execlists_TDR_get_current_request() - return request currently > + * processed by engine > + * > + * @ring: Engine currently running context to be returned. > + * > + * @req: Output parameter containing the current request (the request at the > + * head of execlist queue corresponding to the given ring). May be NULL > + * if no request has been submitted to the execlist queue of this > + * engine. If the req parameter passed in to the function is not NULL > + * and a request is found and returned the request is referenced before > + * it is returned. It is the responsibility of the caller to dereference > + * it at the end of its life cycle. > + * > + * Return: > + * CONTEXT_SUBMISSION_STATUS_OK if request is found to be submitted and its > + * context is currently running on engine. > + * > + * CONTEXT_SUBMISSION_STATUS_INCONSISTENT if request is found to be submitted > + * but its context is not in a state that is consistent with current > + * hardware state for the given engine. This has been observed in three cases: > + * > + * 1. Before the engine has switched to this context after it has > + * been submitted to the execlist queue. > + * > + * 2. After the engine has switched away from this context but > + * before the context has been removed from the execlist queue. > + * > + * 3. The driver has lost an interrupt. Typically the hardware has > + * gone to idle but the driver still thinks the context belonging to > + * the request at the head of the queue is still executing. > + * > + * CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED if no context has been found > + * to be submitted to the execlist queue and if the hardware is idle. > + */ > +enum context_submission_status > +intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring, > + struct drm_i915_gem_request **req) > +{ > + struct drm_i915_private *dev_priv; > + unsigned long flags; > + struct drm_i915_gem_request *tmpreq = NULL; > + struct intel_context *tmpctx = NULL; > + unsigned hw_context = 0; > + unsigned sw_context = 0; > + bool hw_active = false; > + enum context_submission_status status = > + CONTEXT_SUBMISSION_STATUS_UNDEFINED; > + > + if (WARN_ON(!ring)) > + return status; > + > + dev_priv = ring->dev->dev_private; > + > + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL); > + spin_lock_irqsave(&ring->execlist_lock, flags); > + hw_context = I915_READ(RING_EXECLIST_STATUS_CTX_ID(ring)); > + > + hw_active = (I915_READ(RING_EXECLIST_STATUS_LO(ring)) & > + EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS) ? true : false; > + > + tmpreq = list_first_entry_or_null(&ring->execlist_queue, > + struct drm_i915_gem_request, execlist_link); > + > + if (tmpreq) { > + sw_context = intel_execlists_ctx_id((tmpreq->ctx)->engine[ring->id].state); > + > + /* > + * Only acknowledge the request in the execlist queue if it's > + * actually been submitted to hardware, otherwise there's the > + * risk of a false inconsistency detection between the > + * (unsubmitted) request and the idle hardware state. > + */ > + if (tmpreq->elsp_submitted > 0) { > + /* > + * If the caller has not passed a non-NULL req > + * parameter then it is not interested in getting a > + * request reference back. Don't temporarily grab a > + * reference since holding the execlist lock is enough > + * to ensure that the execlist code will hold its > + * reference all throughout this function. As long as > + * that reference is kept there is no need for us to > + * take yet another reference. The reason why this is > + * of interest is because certain callers, such as the > + * TDR hang checker, cannot grab struct_mutex before > + * calling and because of that we cannot dereference > + * any requests (DRM might assert if we do). Just rely > + * on the execlist code to provide indirect protection. > + */ > + if (req) > + i915_gem_request_reference(tmpreq); > + > + if (tmpreq->ctx) > + tmpctx = tmpreq->ctx; > + } > + } > + > + if (tmpctx) { > + status = ((hw_context == sw_context) && hw_active) ? > + CONTEXT_SUBMISSION_STATUS_OK : > + CONTEXT_SUBMISSION_STATUS_INCONSISTENT; > + } else { > + /* > + * If we don't have any queue entries and the > + * EXECLIST_STATUS register points to zero we are > + * clearly not processing any context right now > + */ > + WARN((hw_context || hw_active), "hw_context=%x, hardware %s!\n", > + hw_context, hw_active ? "not idle":"idle"); > + > + status = (hw_context || hw_active) ? > + CONTEXT_SUBMISSION_STATUS_INCONSISTENT : > + CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED; > + } > + > + if (req) > + *req = tmpreq; > + > + spin_unlock_irqrestore(&ring->execlist_lock, flags); > + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > + > + return status; > +} > diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h > index de41ad6..d9acb31 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.h > +++ b/drivers/gpu/drm/i915/intel_lrc.h > @@ -29,7 +29,9 @@ > /* Execlists regs */ > #define RING_ELSP(ring) _MMIO((ring)->mmio_base + 0x230) > #define RING_EXECLIST_STATUS_LO(ring) _MMIO((ring)->mmio_base + 0x234) > +#define EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS (0x3 << 14) > #define RING_EXECLIST_STATUS_HI(ring) _MMIO((ring)->mmio_base + 0x234 + 4) > +#define RING_EXECLIST_STATUS_CTX_ID(ring) RING_EXECLIST_STATUS_HI(ring) > #define RING_CONTEXT_CONTROL(ring) _MMIO((ring)->mmio_base + 0x244) > #define CTX_CTRL_INHIBIT_SYN_CTX_SWITCH (1 << 3) > #define CTX_CTRL_ENGINE_CTX_RESTORE_INHIBIT (1 << 0) > @@ -118,4 +120,16 @@ u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj); > void intel_lrc_irq_handler(struct intel_engine_cs *ring); > void intel_execlists_retire_requests(struct intel_engine_cs *ring); > > +int intel_execlists_read_tail(struct intel_engine_cs *ring, > + struct intel_context *ctx, > + u32 *tail); > + > +int intel_execlists_write_head(struct intel_engine_cs *ring, > + struct intel_context *ctx, > + u32 head); > + > +int intel_execlists_read_head(struct intel_engine_cs *ring, > + struct intel_context *ctx, > + u32 *head); > + > #endif /* _INTEL_LRC_H_ */ > diff --git a/drivers/gpu/drm/i915/intel_lrc_tdr.h b/drivers/gpu/drm/i915/intel_lrc_tdr.h > new file mode 100644 > index 0000000..4520753 > --- /dev/null > +++ b/drivers/gpu/drm/i915/intel_lrc_tdr.h > @@ -0,0 +1,36 @@ > +/* > + * Copyright © 2015 Intel Corporation > + * > + * Permission is hereby granted, free of charge, to any person obtaining a > + * copy of this software and associated documentation files (the "Software"), > + * to deal in the Software without restriction, including without limitation > + * the rights to use, copy, modify, merge, publish, distribute, sublicense, > + * and/or sell copies of the Software, and to permit persons to whom the > + * Software is furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice (including the next > + * paragraph) shall be included in all copies or substantial portions of the > + * Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > + * DEALINGS IN THE SOFTWARE. > + */ > + > +#ifndef _INTEL_LRC_TDR_H_ > +#define _INTEL_LRC_TDR_H_ > + > +/* Privileged execlist API used exclusively by TDR */ > + > +void intel_execlists_TDR_context_resubmission(struct intel_engine_cs *ring); > + > +enum context_submission_status > +intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring, > + struct drm_i915_gem_request **req); > + > +#endif /* _INTEL_LRC_TDR_H_ */ > + > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c > index 4060acf..def0dcf 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > @@ -434,6 +434,88 @@ static void ring_write_tail(struct intel_engine_cs *ring, > I915_WRITE_TAIL(ring, value); > } > > +int intel_ring_disable(struct intel_engine_cs *ring) > +{ > + WARN_ON(!ring); > + > + if (ring && ring->disable) > + return ring->disable(ring); > + else { > + DRM_ERROR("Ring disable not supported on %s\n", ring->name); > + return -EINVAL; > + } > +} > + > +int intel_ring_enable(struct intel_engine_cs *ring) > +{ > + WARN_ON(!ring); > + > + if (ring && ring->enable) > + return ring->enable(ring); > + else { > + DRM_ERROR("Ring enable not supported on %s\n", ring->name); > + return -EINVAL; > + } > +} > + > +int intel_ring_save(struct intel_engine_cs *ring, > + struct drm_i915_gem_request *req, > + bool force_advance) > +{ > + WARN_ON(!ring); > + > + if (ring && ring->save) > + return ring->save(ring, req, force_advance); > + else { > + DRM_ERROR("Ring save not supported on %s\n", ring->name); > + return -EINVAL; > + } > +} > + > +int intel_ring_restore(struct intel_engine_cs *ring, > + struct drm_i915_gem_request *req) > +{ > + WARN_ON(!ring); > + > + if (ring && ring->restore) > + return ring->restore(ring, req); > + else { > + DRM_ERROR("Ring restore not supported on %s\n", ring->name); > + return -EINVAL; > + } > +} > + > +void intel_gpu_engine_reset_resample(struct intel_engine_cs *ring, > + struct drm_i915_gem_request *req) > +{ > + struct intel_ringbuffer *ringbuf; > + struct drm_i915_private *dev_priv; > + > + if (WARN_ON(!ring)) > + return; > + > + dev_priv = ring->dev->dev_private; > + > + if (i915.enable_execlists) { > + struct intel_context *ctx; > + > + if (WARN_ON(!req)) > + return; > + > + ctx = req->ctx; > + ringbuf = ctx->engine[ring->id].ringbuf; > + > + /* > + * In gen8+ context head is restored during reset and > + * we can use it as a reference to set up the new > + * driver state. > + */ > + I915_READ_HEAD_CTX(ring, ctx, ringbuf->head); > + ringbuf->last_retired_head = -1; > + intel_ring_update_space(ringbuf); > + } > +} > + > u64 intel_ring_get_active_head(struct intel_engine_cs *ring) > { > struct drm_i915_private *dev_priv = ring->dev->dev_private; > @@ -629,7 +711,7 @@ static int init_ring_common(struct intel_engine_cs *ring) > ringbuf->tail = I915_READ_TAIL(ring) & TAIL_ADDR; > intel_ring_update_space(ringbuf); > > - memset(&ring->hangcheck, 0, sizeof(ring->hangcheck)); > + i915_hangcheck_reinit(ring); > > out: > intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h > index 7349d92..7014778 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h > @@ -49,6 +49,22 @@ struct intel_hw_status_page { > #define I915_READ_MODE(ring) I915_READ(RING_MI_MODE((ring)->mmio_base)) > #define I915_WRITE_MODE(ring, val) I915_WRITE(RING_MI_MODE((ring)->mmio_base), val) > > + > +#define I915_READ_TAIL_CTX(engine, ctx, outval) \ > + intel_execlists_read_tail((engine), \ > + (ctx), \ > + &(outval)); > + > +#define I915_READ_HEAD_CTX(engine, ctx, outval) \ > + intel_execlists_read_head((engine), \ > + (ctx), \ > + &(outval)); > + > +#define I915_WRITE_HEAD_CTX(engine, ctx, val) \ > + intel_execlists_write_head((engine), \ > + (ctx), \ > + (val)); > + Don't see the benefit of all the macros. If you look at lrc_reg_state we can throw most if not all this register reading/writing code out. > /* seqno size is actually only a uint32, but since we plan to use MI_FLUSH_DW to > * do the writes, and that must have qw aligned offsets, simply pretend it's 8b. > */ > @@ -94,6 +110,34 @@ struct intel_ring_hangcheck { > enum intel_ring_hangcheck_action action; > int deadlock; > u32 instdone[I915_NUM_INSTDONE_REG]; > + > + /* > + * Last recorded ring head index. > + * This is only ever a ring index where as active > + * head may be a graphics address in a ring buffer > + */ > + u32 last_head; > + > + /* Flag to indicate if engine reset required */ > + atomic_t flags; > + > + /* Indicates request to reset this engine */ > +#define I915_ENGINE_RESET_IN_PROGRESS (1<<0) > + > + /* > + * Timestamp (seconds) from when the last time > + * this engine was reset. > + */ > + u32 last_engine_reset_time; > + > + /* > + * Number of times this engine has been > + * reset since boot > + */ > + u32 reset_count; > + > + /* Number of TDR hang detections */ > + u32 tdr_count; > }; > > struct intel_ringbuffer { > @@ -205,6 +249,14 @@ struct intel_engine_cs { > #define I915_DISPATCH_RS 0x4 > void (*cleanup)(struct intel_engine_cs *ring); > > + int (*enable)(struct intel_engine_cs *ring); > + int (*disable)(struct intel_engine_cs *ring); > + int (*save)(struct intel_engine_cs *ring, > + struct drm_i915_gem_request *req, > + bool force_advance); > + int (*restore)(struct intel_engine_cs *ring, > + struct drm_i915_gem_request *req); > + > /* GEN8 signal/wait table - never trust comments! > * signal to signal to signal to signal to signal to > * RCS VCS BCS VECS VCS2 > @@ -311,6 +363,9 @@ struct intel_engine_cs { > > struct intel_ring_hangcheck hangcheck; > > + /* Saved head value to be restored after reset */ > + u32 saved_head; > + > struct { > struct drm_i915_gem_object *obj; > u32 gtt_offset; > @@ -463,6 +518,15 @@ void intel_ring_update_space(struct intel_ringbuffer *ringbuf); > int intel_ring_space(struct intel_ringbuffer *ringbuf); > bool intel_ring_stopped(struct intel_engine_cs *ring); > > +void intel_gpu_engine_reset_resample(struct intel_engine_cs *ring, > + struct drm_i915_gem_request *req); > +int intel_ring_disable(struct intel_engine_cs *ring); > +int intel_ring_enable(struct intel_engine_cs *ring); > +int intel_ring_save(struct intel_engine_cs *ring, > + struct drm_i915_gem_request *req, bool force_advance); > +int intel_ring_restore(struct intel_engine_cs *ring, > + struct drm_i915_gem_request *req); > + > int __must_check intel_ring_idle(struct intel_engine_cs *ring); > void intel_ring_init_seqno(struct intel_engine_cs *ring, u32 seqno); > int intel_ring_flush_all_caches(struct drm_i915_gem_request *req); > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > index 2df4246..f20548c 100644 > --- a/drivers/gpu/drm/i915/intel_uncore.c > +++ b/drivers/gpu/drm/i915/intel_uncore.c > @@ -1623,6 +1623,153 @@ bool intel_has_gpu_reset(struct drm_device *dev) > return intel_get_gpu_reset(dev) != NULL; > } > > +static inline int wait_for_engine_reset(struct drm_i915_private *dev_priv, > + unsigned int grdom) > +{ No need to inline > +#define _CND ((__raw_i915_read32(dev_priv, GEN6_GDRST) & grdom) == 0) > + > + /* > + * Spin waiting for the device to ack the reset request. > + * Times out after 500 us > + * */ > + return wait_for_atomic_us(_CND, 500); > + > +#undef _CND > +} > + > +static int do_engine_reset_nolock(struct intel_engine_cs *engine) > +{ > + int ret = -ENODEV; > + struct drm_i915_private *dev_priv = engine->dev->dev_private; > + > + assert_spin_locked(&dev_priv->uncore.lock); > + > + switch (engine->id) { > + case RCS: > + __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_RENDER); > + engine->hangcheck.reset_count++; > + ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_RENDER); > + break; > + > + case BCS: > + __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_BLT); > + engine->hangcheck.reset_count++; > + ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_BLT); > + break; > + > + case VCS: > + __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_MEDIA); > + engine->hangcheck.reset_count++; > + ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_MEDIA); > + break; > + > + case VECS: > + __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_VECS); > + engine->hangcheck.reset_count++; > + ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_VECS); > + break; > + > + case VCS2: > + __raw_i915_write32(dev_priv, GEN6_GDRST, GEN8_GRDOM_MEDIA2); > + engine->hangcheck.reset_count++; > + ret = wait_for_engine_reset(dev_priv, GEN8_GRDOM_MEDIA2); > + break; > + > + default: > + DRM_ERROR("Unexpected engine: %d\n", engine->id); > + break; > + } int mask[NUM_RINGS] = { GEN6_GRDOM_RENDER, GEN6_GDROM_BLT...}; if (WARN_ON_ONCE(!engine->initialized)) return; __raw_i915_write(dev_priv, mask[engine->id]); engine->hangcheck.reset_count++; ret = wait_for_engine_reset(dev_priv, mask[engine->id]); > + > + return ret; > +} > + > +static int gen8_do_engine_reset(struct intel_engine_cs *engine) > +{ > + struct drm_device *dev = engine->dev; > + struct drm_i915_private *dev_priv = dev->dev_private; > + int ret = -ENODEV; > + unsigned long irqflags; > + > + spin_lock_irqsave(&dev_priv->uncore.lock, irqflags); > + ret = do_engine_reset_nolock(engine); > + spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags); > + > + if (!ret) { > + u32 reset_ctl = 0; > + > + /* > + * Confirm that reset control register back to normal > + * following the reset. > + */ > + reset_ctl = I915_READ(RING_RESET_CTL(engine->mmio_base)); > + WARN(reset_ctl & 0x3, "Reset control still active after reset! (0x%08x)\n", > + reset_ctl); > + } else { > + DRM_ERROR("Engine reset failed! (%d)\n", ret); > + } > + > + return ret; > +} > + > +int intel_gpu_engine_reset(struct intel_engine_cs *engine) > +{ > + /* Reset an individual engine */ > + int ret = -ENODEV; > + struct drm_device *dev = engine->dev; > + > + switch (INTEL_INFO(dev)->gen) { You can pass dev_priv to INTEL_INFO also, and prefer to do so in here and rest of the code. > + case 8: case 9: ? Thanks, -Mika > + ret = gen8_do_engine_reset(engine); > + break; > + default: > + DRM_ERROR("Per Engine Reset not supported on Gen%d\n", > + INTEL_INFO(dev)->gen); > + break; > + } > + > + return ret; > +} > + > +/* > + * On gen8+ a reset request has to be issued via the reset control register > + * before a GPU engine can be reset in order to stop the command streamer > + * and idle the engine. This replaces the legacy way of stopping an engine > + * by writing to the stop ring bit in the MI_MODE register. > + */ > +int intel_request_gpu_engine_reset(struct intel_engine_cs *engine) > +{ > + /* Request reset for an individual engine */ > + int ret = -ENODEV; > + struct drm_device *dev = engine->dev; > + > + if (INTEL_INFO(dev)->gen >= 8) > + ret = gen8_request_engine_reset(engine); > + else > + DRM_ERROR("Reset request not supported on Gen%d\n", > + INTEL_INFO(dev)->gen); > + > + return ret; > +} > + > +/* > + * It is possible to back off from a previously issued reset request by simply > + * clearing the reset request bit in the reset control register. > + */ > +int intel_unrequest_gpu_engine_reset(struct intel_engine_cs *engine) > +{ > + /* Roll back reset request for an individual engine */ > + int ret = -ENODEV; > + struct drm_device *dev = engine->dev; > + > + if (INTEL_INFO(dev)->gen >= 8) > + ret = gen8_unrequest_engine_reset(engine); > + else > + DRM_ERROR("Reset unrequest not supported on Gen%d\n", > + INTEL_INFO(dev)->gen); > + > + return ret; > +} > + > bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv) > { > return check_for_unclaimed_mmio(dev_priv); > -- > 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx