Re: [PATCH 0/3] Per Engine hang detection and recovery

"Siluvery, Arun" <arun.siluvery@xxxxxxxxx> · Mon, 11 Nov 2013 15:49:01 +0000

On Mon, 2013-11-11 at 16:31 +0100, Daniel Vetter wrote:
> On Mon, Nov 11, 2013 at 02:58:31PM +0000, Siluvery, Arun wrote:
> > From: "Siluvery, Arun" <arun.siluvery@xxxxxxxxx>
> > 
> > This patchset contains changes for Timeout detection and recovery (TDR) which
> > provides per-engine hang detection and recovery.
> > The current driver performs full gpu reset in case of a hang, TDR attempts to
> > only reset the engine that is hung and it falls back to full reset if it fails.
> > 
> > Full GPU reset can leave the system in a state where the display updates
> > intermittently and possibly lock-up depending on the work load at the time of
> > hang. TDR can help recover the system in those case thus increasing the stability.
> 
> Are these hw lockups you've seen with full gpu reset or just kernel
> deadlocks? If it's the latter we've recently (re-)fixed a bunch of those,
> and if there are new ones we definitely want to fix them and add testcases
> to igt. So if you could share some of these hangs and their
> analysis/testcases that's be very interesting.
> 
> That's of course on top of any other reset improvements.

I think these are kernel lockups, unfortunately when this happens there
is no response from the kernel, sending break is also not helping. I
will try to get more details on this.

> 
> > The changes are split in multiple patches.
> > 1. Ring utility functions to save/restore context, reset ring etc
> > 2. TDR hang detection logic and error recovery function
> > 3. Debugfs changes to export TDR statistics.
> > 
> > I have tested these changes on drm-intel-nightly with simple test which
> > inserts a bad batch buffer on the specific to trigger a hang. TDR logic
> > then detects this and recovers from it by skipping the bad batch.
> 
> I want this testcase (as a patch to igt).

ok, I will send it to the mailing list.

> 
> > Please review and give your comments.
> 
> I'll try to have a look later this week, atm still busy with bdw
> upstreaming. One more meta-comment though: Something with your git setup
> seems to be broken, the patches don't have in-reply-to headers pointing at
> this cover letter and hence the threading is a bit broken.

ok thanks.
yes my mistake I missed an option while generating the patches.
Do you suggest resending all patches again?

> 
> Cheers, Daniel
> > 
> > regards
> > Arun
> > 
> > Siluvery, Arun (3):
> >   drm/1915: Add ring functions to save/restore context for per-ring
> >     reset
> >   drm/i915: Per-engine Timeout detection and recovery on HSW
> >   drm/i915: Export TDR hang count to debugfs
> > 
> >  drivers/gpu/drm/i915/i915_debugfs.c     |  68 +++-
> >  drivers/gpu/drm/i915/i915_dma.c         |  16 +-
> >  drivers/gpu/drm/i915/i915_drv.c         | 195 +++++++++-
> >  drivers/gpu/drm/i915/i915_drv.h         |  92 ++++-
> >  drivers/gpu/drm/i915/i915_gem.c         |  77 +++-
> >  drivers/gpu/drm/i915/i915_gpu_error.c   |  25 +-
> >  drivers/gpu/drm/i915/i915_irq.c         | 556 ++++++++++++++++-------------
> >  drivers/gpu/drm/i915/i915_reg.h         |   7 +
> >  drivers/gpu/drm/i915/intel_display.c    |  25 +-
> >  drivers/gpu/drm/i915/intel_ringbuffer.c | 607 +++++++++++++++++++++++++++++++-
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  51 +++
> >  drivers/gpu/drm/i915/intel_uncore.c     |  31 +-
> >  include/drm/drmP.h                      |   7 +
> >  13 files changed, 1467 insertions(+), 290 deletions(-)
> > 
> > -- 
> > 1.8.4
> > 
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx