Re: [RFC 00/13] TDR and Watchdog Reset

Daniel Vetter <daniel@xxxxxxxx> · Mon, 16 Dec 2013 22:12:09 +0100

On Mon, Dec 16, 2013 at 04:02:20PM +0000, Lister, Ian wrote:
> This patchset contains TDR and Watchdog reset against a 3.13
> drm-intel-nightly tree that is now about 2 weeks old.
> 
> I have re-worked the TDR and Watchdog Reset features to integrate
> them more closely with the existing TDR and scoring mechanism.
> 
> This is still a work-in-progress and I am currently debugging a couple
> of issues but I would like to get some early feedback.
> 
> Thanks,
> Ian.
> 
> From f71a7de85e9d81be3aa3962c8fe2557235ff21c1 Mon Sep 17 00:00:00 2001
> Message-Id: <cover.1387201899.git.ian.lister@xxxxxxxxx>
> From: ian-lister <ian.lister@xxxxxxxxx>
> Date: Mon, 16 Dec 2013 13:51:39 +0000
> Subject: [RFC 00/13] TDR and Watchdog Reset
> 
> This patchset adds support for per-engine timeout detection and recovery
> and adds batch specific watchdog reset.

Ok, I've read through the patches and dropped my high-level maintainer
review on them. Overall this looks much better than the first tdr rfc and
was a nice read. The few misplaced hunks (probably due to some rebase
mixup) didn't really confuse the overall picture.

For the detailed review I'll sign up Mika, he's been doing most of the
reset/hangcheck related changes recently in upstream. The really tricky
bits around the hangcheck code are the ordering constraints in combination
with lock taking/releasing. Chris, Ville and me have mostly reviewed those
parts, so I'll definitely double-check that before merging.

For merging the patches I always prefer an incremental approach once early
pieces are ready with review and testcases. Below a few more comments on
the features overall, mostly around testing.

> Per-ring TDR
> The detection logic has been modified to detect hangs on individual
> engines and pass this information through the to the recovery handler.
> Rather than a global reset it will attempt a per-engine reset. The
> registers associated with the ring are saved and restored so that
> when the ring restarts it continues from the next instruction in the
> ring. For example, if it was executing an MI_START_BATCH_BUFFER command
> it will advance to the next instruction which is likely to be the
> mailbox updates and user interrupt. This means that no extra effort
> is required to deal with synchronisation. From the perspective of the
> driver it looks like the batch buffer completed normally as all the
> normal signalling will take place, however the context stats will
> have been updated to flag up the guilty context.

hangcheck code has always been terribly tricky, so I want solid test
coverage for corner cases here. I think the following should do the trick:

- Testcase that exercise some of the starvation issues you fix in the
  first patches. That looks like a real bug, so I also prefer to merge
  those patches as soon as they're ready.

- Patches to exercise the per-ring reset stuff by using different
  contexts/file descriptors and checking that unrelated work does get
  completed correctly. The tricky bits here are pageflips (we already have
  a basic testcase for that) and semaphores. For semaphores we currently
  lack a real functional test, but Damien is signed up to write that once
  he's back from vacation. So probably useful to sync up with him. For
  pageflip tests you'll get style points if you use the CRC support to
  check for functional correctness of which buffer gets flipped to the
  front in case of hangs, but imo that's overkill.

The pageflip tests are all subtests of kms_flip with "hang" somewhere in
the name. The downside for Android is that this uses cairo, so atm it's
not enabled. But apparently some group in VPG has libcairo ported already,
so hopefully this will be sorted out soon.

I guess you already have some other validation tests around, porting these
to i-g-t (and then using a common codebase) would also be great (I don't
yet have source access, but should have it soon).

Given that we don't expose the hangcheck improvements to userspace
(besides the fact that fewer contexts will be victimized) there's no need
in that area for testcases.

> Watchdog Reset
> This is requested via flags to the batch buffer submission IOCTL.
> It is currently only supported for the render and video rings.
> The batch buffer command is surrounded by a hardware timer start
> command and stop command. If the batch completes before the timer
> expires then the timer is cancelled and no interrupt is generated
> so everything continues normally. However if the batch hangs then
> the timer will generate an interrupt and it will trigger an engine
> reset. This feature requires per-ring TDR to do the recovery work.

First a comment on the interface: I have honestly lost track of where
exactly the watchdog threshold gets set, but I think we should leave that
option to userspace.

For testcases we need the usual mix for new features:
- Exercising abi cornercases (especially rejecting evil/stupid data from
  userspace).
- Some functional check. With the rendercpy/mediafill infrastructure it
  should be fairly easy to construct RCS/VCS batches which take way too
  long.

The real big deal here is though that for this new abi extension I need an
open-source user. So libva needs to be ported to use this feature for both
it's vcs and rcs batches. Without an open-source user I can't merge this.

The actual merge order for abi extensions is also rather strict:
1. Post both kernel patches and UMD enabling patches for a given feature.
2. Once the entire slice (so both KMD and UMD parts) is reviewed and
tested the kernel parts get merged.
3. As soon as the kernel patches hit a stable git branch (i.e.
drm-intel-next) we can merge the libdrm patches.
4. After a libdrm release the actual UMD patches can go in.

A month ago we've had a little screw up with that and Dave Airlie (drm
upstream maintainer) stepped in an reverted a few things (resulting in
some temporary build breakage). So following the rules here is important.

Thanks, Daniel

PS: One thing I've forgotten in all this: Like with the from address in
the patches your name in the s-o-b lines should be "Ian Lister" instead of
"ian-lister".

> 
> ian-lister (13):
>   drm/i915: Periodic sampling for hang detection
>   drm/i915: Improved hang detection logic
>   drm/i915: Additional ring operations for TDR
>   drm/i915: Force wake restore for TDR
>   drm/i915: Per-engine recovery
>   drm/i915: Communicating reset requests
>   drm/i915: Additional debug for TDR
>   drm/i915: TDR loose ends
>   drm/i915: Watchdog timer support functions
>   drm/i915: MI_LOAD_REGISTER_IMM fix
>   drm/i915: Added watchdog interrupt handling
>   drm/i915: Enabled watchdog timer interrupts
>   drm/i915: Exec buffer inserts watchdog commands
> 
>  drivers/gpu/drm/i915/i915_debugfs.c        |  67 ++++
>  drivers/gpu/drm/i915/i915_dma.c            |   3 +
>  drivers/gpu/drm/i915/i915_drv.c            |  46 +++
>  drivers/gpu/drm/i915/i915_drv.h            |  41 ++-
>  drivers/gpu/drm/i915/i915_gem.c            |  26 +-
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  30 +-
>  drivers/gpu/drm/i915/i915_irq.c            | 531
> ++++++++++++++++++---------
>  drivers/gpu/drm/i915/i915_reg.h            |  21 ++
>  drivers/gpu/drm/i915/intel_display.c       |  30 +-
>  drivers/gpu/drm/i915/intel_ringbuffer.c    | 557
> +++++++++++++++++++++++++++--
>  drivers/gpu/drm/i915/intel_ringbuffer.h    |  53 +++
>  drivers/gpu/drm/i915/intel_uncore.c        | 373 ++++++++++++++++++-
>  include/drm/drmP.h                         |   7 +
>  13 files changed, 1575 insertions(+), 210 deletions(-)
> 
> -- 
> 1.8.5.1
> 
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx