On Sun, Feb 24, 2013 at 01:24:59PM -0800, Ben Widawsky wrote: > On Sun, Feb 24, 2013 at 09:34:30AM +0000, Chris Wilson wrote: > > On Sat, Feb 23, 2013 at 05:30:10PM -0800, Ben Widawsky wrote: > > > On error, this represents the state of the currently running context at > > > the time it was loaded. > > > > > > Unfortunately, since we're hung and can't switch out the context this > > > may not tell us too much about the most current state of the context, > > > but does give clues about what has happened since loading. > > > > > > Thanks to recent doc updates, we have a little more confidence regarding > > > what is actually in this memory, and perhaps it will help us gain more > > > insight into certain bugs. AFAICT, the most interesting info is in the > > > first page. To save space, we only capture the first page. In the > > > future, we might want to dump more. > > > > > > Sample of the relevant part of error state: > > > --- HW Context = 0x01b20000 > > > 00000000 : 00000000 1100105f 00002028 ffff0880 > > > 00000010 : 0000209c feff4040 000020c0 efdf0080 > > > 00000020 : 00002178 00000001 0000217c 00145855 > > > 00000030 : 00002310 00000000 00002314 00000000 > > > 00000040 : 00002318 00000000 0000231c 00000000 > > > > Presentation looks reasonable, except it will confuse > > intel_error_decode as it will match "%x : %x". How about > > "[%03x] %08x %08x %08x %08x"? > > Will %4x be okay? I suspect 1 page may not be sufficient at some point. > Also, it's probably worth modifying intel_error_decode since I suspect > over time we'll want to dump more objects. %04x is fine. I was simply trimming a few zeroes for brevity. Using [] was the key to making it distinct enough for intel_error_decode was still being comprehensible for us. > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=55845 > > > Cc: Chris Wilson <chris at chris-wilson.co.uk> > > > Signed-off-by: Ben Widawsky <ben at bwidawsk.net> > > > --- > > > > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > > > index e95337c..ab88620 100644 > > > --- a/drivers/gpu/drm/i915/i915_drv.h > > > +++ b/drivers/gpu/drm/i915/i915_drv.h > > > @@ -209,6 +209,7 @@ struct drm_i915_error_state { > > > u32 pgtbl_er; > > > u32 ier; > > > u32 ccid; > > > + struct drm_i915_error_object *ctx_obj; > > > > Put it next to the other pointers; lest we want to start digging holes. > > On that request, I am going to make 1 per ring then. We don't have any > non-ring error objects currently. I've also always been in favor of > having the code more ring independent but that was removed because of > Daniel. Yes. I also think of contexts as being a far more generic constructs than just the current automagic hw save/restore. > > > u32 derrmr; > > > u32 forcewake; > > > bool waiting[I915_NUM_RINGS]; > > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > > > index ebaf558..7f7d241 100644 > > > --- a/drivers/gpu/drm/i915/i915_irq.c > > > +++ b/drivers/gpu/drm/i915/i915_irq.c > > > @@ -1321,6 +1321,14 @@ static void i915_capture_error_state(struct drm_device *dev) > > > error->pgtbl_er = I915_READ(PGTBL_ER); > > > error->ccid = I915_READ(CCID); > > > > > > + if (error->ccid && !dev_priv->hw_contexts_disabled) { > > > + list_for_each_entry(obj, &dev_priv->mm.active_list, mm_list) > > > > I am doubtful that the active list will hold the object in all cases, as > > we only put the context obj onto the active list when switching away. > > I'd check the gtt_list to be on the safe side. > > Hmm, that shouldn't have been how the code worked when I wrote it, and > I'm too lazy to look if it's changed. The context should be made active > on the switch to it. If we use rc6, and don't do that, that's really bad > because rc6 will automatically save/restore the context (though as long > as it remained pinned we're safe). In any case, gtt_list is fine with > me, though I still believe if it's not on the active list, it's an > error. Hmm, the code we ended up with only did move_to_active() when switching away in order for it to remain pinned until after the MI_SET_CONTEXT had been processed. Before then it is kept alive through explicit pinning and referencing from the ring. > > And ignore what we think of hw_context_disabled - if the CCID randomly > > points to one of our objects, lets attach it. > > I can switch to HAS_HW_CONTEXTS for this. I'm really hoping we never see > CCID != 0 when not using contexts though. Complete agree. I also dread the day we see random CCID values. But stanger things have happened. -Chris -- Chris Wilson, Intel Open Source Technology Centre