[PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin

daniel at ffwll.ch (Daniel Vetter) · Mon, 2 Jul 2012 18:47:48 +0200

On Mon, Jul 02, 2012 at 09:04:48AM -0700, Ben Widawsky wrote:
> On Sun, 1 Jul 2012 12:41:19 +0200
> Daniel Vetter <daniel at ffwll.ch> wrote:
> 
> > On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben at bwidawsk.net> wrote:
> > > On Tue, 26 Jun 2012 23:08:50 +0200
> > > Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
> > >
> > >> Having this early check in intel_ring_begin doesn't buy us anything,
> > >> since we'll be calling into wait_request in the usual case already
> > >> anyway. In the corner case of not waiting for free space using the
> > >> last_retired_head we simply need to do the same check, too.
> > >>
> > >> With these changes we'll only ever get an -EIO from intel_ring_begin
> > >> if the gpu has truely been declared dead.
> > >>
> > >> v2: Don't conflate the change to prevent intel_ring_begin from returning
> > >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
> > >> just prep work to avoid returning -EAGAIN in callsites that can't handle
> > >> syscall restarting.
> > >
> > > I'm really scared by this change. It's diffuclt to review because there
> > > are SO many callers of intel_ring_begin, and figuring out if they all
> > > work in the wedged case is simply too difficult. I've yet to review the
> > > rest of the series, but it may make more sense to put this change last
> > > perhaps?
> > 
> > Well, this patch doesn't really affect much what error codes the
> > callers get - we'll still throw both -EGAIN and -EIO at them (later on
> > patches will fix this).
> > 
> > What this patch does though is prevent us from returning too many
> > -EIO. Imagine the gpu died and we've noticed already (hence
> > dev_priv->mm.wedged is set), but some process is stuck churning
> > through the execbuf ioctl, holding dev->struct_mutex. While processing
> > the execbuf we call intel_ring_begin to emit a few commands. Now
> > usually, even when the gpu is dead, there is enough space in the ring
> > to do so, which allows us to complete the execbuf ioctl and then later
> > on we can block properly when trying to grab the mutex in the next
> > ioctl for the gpu reset work handler to complete.
> 
> That in itself is a pretty big change, don't you think? It seems rather
> strange and dangerous to modify HW (which we will if we allow execbuf to
> continue when we write the tail pointer). I think we want some way to
> not write the tail ptr in such a case. Now you might respond, well, who
> cares about writes? But this imposes a pretty large restriction on any
> code that can't work well after the GPU is hung.
> 
> I see the irony. I'm complaining that you can make GPU hangs unstable,
> and the patch series is fixing GPU reset. Call it paranoia, it still
> seems unsafe to me, and makes us have to think a bit more whenever
> adding any code. Am I over-thinking this?

I think so. It takes us a few seconds before we notice that the gpu died -
only when the hangcheck timer expired at least twice we'll declare the hw
wedged. Imo a few seconds is plenty of time to sneak in a few ringbuffer
emissions (and hence tail pointer writes) ;-)

So I think trying to avoid touching the gpu for another few usecs _is_
overthinking the problem, because fundamentally we can't avoid touching a
dead gpu - we will only ever notice its demise after the fact.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48