Re: [PATCH] RFC drm/i915: Stop the machine whilst capturing the GPU crash dump

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Tue, 13 Oct 2015 14:52:46 +0100

On Tue, Oct 13, 2015 at 03:52:08PM +0200, Daniel Vetter wrote:
> On Tue, Oct 13, 2015 at 01:24:53PM +0100, Chris Wilson wrote:
> > On Tue, Oct 13, 2015 at 02:09:59PM +0200, Daniel Vetter wrote:
> > > On Fri, Oct 09, 2015 at 06:55:23PM +0100, Chris Wilson wrote:
> > > > On Fri, Oct 09, 2015 at 07:33:23PM +0200, Daniel Vetter wrote:
> > > > > On Fri, Oct 09, 2015 at 01:21:45PM +0100, Chris Wilson wrote:
> > > > > > The error state is purposefully racy as we expect it to be called at any
> > > > > > time and so have avoided any locking whilst capturing the crash dump.
> > > > > > However, with multi-engine GPUs and multiple CPUs, those races can
> > > > > > manifest into OOPSes as we attempt to chase dangling pointers freed on
> > > > > > other CPUs. Under discussion are lots of ways to slow down normal
> > > > > > operation in order to protect the post-mortem error capture, but what it
> > > > > > we take the opposite approach and freeze the machine whilst the error
> > > > > > catpure runs (note the GPU may still running, but as long as we don't
> > > > > > process any of the results the driver's bookkeeping will be static).
> > > > > > 
> > > > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > > > > 
> > > > > One risk I see is that the list walking might still go off the rails when
> > > > > we stop the machine right in the middle of a list_move. With that we might
> > > > > start scanning the active list (of objects) on one engine and then midway
> > > > > through get to another engine and so never see the list_head again,
> > > > > looping forever. No idea yet what to do with that.
> > > > 
> > > > A move is a del followed by an insertion, you cannot step into an entry
> > > > that is the middle of moving lists - don't forget that stop_machine() is
> > > > a very heavy memory barrier. Similarly, the list_add() should ensure we
> > > > can't step forward into an element that will not lead back to the list.
> > > > So I am not convinced that it would be suspectible to that style of
> > > > hijacking.
> > > 
> > > The compiler could do havoc, so I think we need at least somewhat ordered
> > > lists updates. Using rcu lists primitives but stop_machine instead of
> > > kfree_rcu might do the trick.
> > 
> > I'd take the compiler barriers, but I don't want the mb() inside every
> > list update. And with a barrier, only walking the lists forwards in the
> > error capture, and the error capture being inside a stop_machine (so
> > mb() and no concurrent access) is safe. (Quite a list of brittle
> > caveats.)
> 
> Yeah, hence using _rcu list macros. They have the relevant barriers
> already and should work. The only difference is that instead of
> synchronize_rcu on the write side before kfree, we'll use stop_machine on
> the read side. It's still RCU, but with all the cost moved to the read
> side while still keeping the benefit that the read side can be done
> locklessly.

They imply a mb() on every write not just a barrier(), and we do a fair few
list updates on each buffer.

> > > > The only alternative I see to list walking, is not to do any from the
> > > > error capture and rely on attaching enough information to the request
> > > > (along with register state) to be able to do postmortems.
> > > 
> > > That still means we need to at least protect the request lists to get at
> > > said request. And it sounded like you wouldn't like a kfree_rcu in there
> > > that much.
> > 
> > The burden has to be on the error capture side as having to do any atomic
> > operations whilst processing the requests quickly show up in the
> > profiles (at the moment here those profiles are dominated by the memory
> > access required to update the lists, where once those accesses were
> > dwarfed by the locked operations.) so I don't even relish the prospect
> > of adding atomic operations around list walking in the normal case.
> 
> Yeah, spin_lock_irq would be the horror, and that's the only other solid
> plan we have really. One caveat of stop_machine is that we can only use it
> in the error capture, not in the hangcheck itself. But at most we'd need
> to rcu requests properly, and using a engine-local buffer (to avoid the
> risk of jumping off the rails onto another list) for that would fully
> mitigate any rcu costs for freeing. But I didn't check the code whether we
> even need that ;-)

So far we have successfully devised strategies at keeping hangcheck nice
and racy, let's keep believing we can do so in the future.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx