On 18/06/2015 11:36, Chris Wilson wrote:> On Thu, Jun 18, 2015 at
11:11:55AM +0100, Tomas Elf wrote:
>> On 18/06/2015 10:51, Mika Kuoppala wrote:
>>> In order for gen8+ hardware to guarantee that no context switch
>>> takes place during engine reset and that current context is properly
>>> saved, the driver needs to notify and query hw before commencing
>>> with reset.
>>>
>>> There are gpu hangs where the engine gets so stuck that it never will
>>> report to be ready for reset. We could proceed with reset anyway, but
>>> with some hangs with skl, the forced gpu reset will result in a system
>>> hang. By inspecting the unreadiness for reset seems to correlate with
>>> the probable system hang.
>>>
>>> We will only proceed with reset if all engines report that they
>>> are ready for reset. If root cause for system hang is found and
>>> can be worked around with another means, we can reconsider if
>>> we can reinstate full reset for unreadiness case.
>>>
>>> v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
>>> v3: updated commit msg
>>> v4: timeout_ms, simpler error path (Chris)
>>>
>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
>>> Testcase: igt/gem_concurrent_blit --r
prw-blt-overwrite-source-read-rcs-forked
>>> Testcase: igt/gem_concurrent_blit --r
gtt-blt-overwrite-source-read-rcs-forked
>>> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
>>> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
>>> Cc: Tomas Elf <tomas.elf@xxxxxxxxx>
>>> Signed-off-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxx>
>>> ---
>>> drivers/gpu/drm/i915/i915_reg.h | 3 +++
>>> drivers/gpu/drm/i915/intel_uncore.c | 43
++++++++++++++++++++++++++++++++++++-
>>> 2 files changed, 45 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_reg.h
b/drivers/gpu/drm/i915/i915_reg.h
>>> index 0b979ad..3684f92 100644
>>> --- a/drivers/gpu/drm/i915/i915_reg.h
>>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>>> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>>> #define RING_MAX_IDLE(base) ((base)+0x54)
>>> #define RING_HWS_PGA(base) ((base)+0x80)
>>> #define RING_HWS_PGA_GEN6(base) ((base)+0x2080)
>>> +#define RING_RESET_CTL(base) ((base)+0xd0)
>>> +#define RESET_CTL_REQUEST_RESET (1 << 0)
>>> +#define RESET_CTL_READY_TO_RESET (1 << 1)
>>>
>>> #define HSW_GTT_CACHE_EN 0x4024
>>> #define GTT_CACHE_EN_ALL 0xF0007FFF
>>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c
b/drivers/gpu/drm/i915/intel_uncore.c
>>> index 4a86cf0..160a47a 100644
>>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>>> @@ -1455,9 +1455,50 @@ static int gen6_do_reset(struct drm_device *dev)
>>> return ret;
>>> }
>>>
>>> +static int wait_for_register(struct drm_i915_private *dev_priv,
>>> + const u32 reg,
>>> + const u32 mask,
>>> + const u32 value,
>>> + const unsigned long timeout_ms)
>>> +{
>>> + return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
>>> +}
>>> +
>>> +static int gen8_do_reset(struct drm_device *dev)
>>> +{
>>> + struct drm_i915_private *dev_priv = dev->dev_private;
>>> + struct intel_engine_cs *engine;
>>> + int i;
>>> +
>>> + for_each_ring(engine, dev_priv, i) {
>>> + I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>>> + _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>>> +
>>> + if (wait_for_register(dev_priv,
>>> + RING_RESET_CTL(engine->mmio_base),
>>> + RESET_CTL_READY_TO_RESET,
>>> + RESET_CTL_READY_TO_RESET,
>>> + 700)) {
>>> + DRM_ERROR("%s: reset request timeout\n", engine->name);
>>> + goto not_ready;
>>> + }
>>
>> So just to be clear here: If one or more of the reset control
>> registers decide that they are at a point where they will never
>> again be ready for reset we will simply not do a full GPU reset
>> until reboot? Is there perhaps a case where you would want to try
>> reset request once or twice or like five times or whatever but then
>> simply go ahead with the full GPU reset regardless of what the reset
>> control register tells you? After all, it's our only way out if the
>> hardware is truly stuck.
>
> What happens is that we skip the reset, report an error and that marks
> the GPU as wedged. To get out of that state requires user intervention,
> either by rebooting or through use of debugfs/i915_wedged.
That's a fair point, we will mark the GPU as terminally wedged. That's
always been there as a final state where we simply give up. I guess it
might be better to actively mark the GPU as terminally wedged from the
driver's point of view rather than plow ahead in a last ditch effort to
reset the GPU, which may or may not succeed and which may irrecoverably
hang the system in the worst case. I guess we at least protect the
currently running context if we just mark the GPU as terminally wedged
instead of putting it in a potentially undefined state.
>
> We can try to repeat the reset from a workqueue, but we should first
> tackle interaction with TDR first and get your per-engine reset
> upstream, along with it's various levels of backoff and recovery.
> -Chris
My point was more along the lines of bailing out if the reset request
fails and not return an error message but simply keep track of the
number of times we've attempted the reset request. By not returning an
error we would allow more subsequent hang detections to happen (since
the hang is still there), which would end up in the same reset request
in the future. If the reset request would fail more times we would
simply increment the counter and at one point we would decide that we've
had too many unsuccessful reset request attempts and simply go ahead
with the reset anyway and if the reset would fail we would return an
error at that point in time, which would result in a terminally wedged
state. But, yeah, I can see why we shouldn't do this.
We could certainly introduce per-engine reset support into this to add
more levels of recovery and fall-back but in the end if we use reset
handshaking for both per-engine reset and for full GPU reset and if
reset handshaking fails in both cases then we're screwed no matter what
(so we try engine reset request and fail, then fall back to full GPU
reset request and fail there too - terminally wedged!). The reset
request failure will block both per-engine reset and full GPU reset and
result in a terminally wedged state no matter what.
The only thing we gain in this particular case by adding per-engine
reset support is if the reset request failure is limited to the blitter
engine (which Ben Widawsky seems to be questioning on IRC). In that
case, supporting per-engine reset support would allow us to unblock
other engines separately without touching full GPU reset and thereby not
having to request blitter engine reset, avoiding the potential case of
having the blitter engine reset request fail, which would thereby block
any other hang recovery for all engines.
Anyway, if we prefer the terminally wedged state rather than a last
ditch attempt at a full GPU reset then I can understand how this makes
sense.
Thanks,
Tomas
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx