On Mon, Jul 02, 2012 at 09:04:48AM -0700, Ben Widawsky wrote: > On Sun, 1 Jul 2012 12:41:19 +0200 > Daniel Vetter <daniel at ffwll.ch> wrote: > > > On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben at bwidawsk.net> wrote: > > > On Tue, 26 Jun 2012 23:08:50 +0200 > > > Daniel Vetter <daniel.vetter at ffwll.ch> wrote: > > > > > >> Having this early check in intel_ring_begin doesn't buy us anything, > > >> since we'll be calling into wait_request in the usual case already > > >> anyway. In the corner case of not waiting for free space using the > > >> last_retired_head we simply need to do the same check, too. > > >> > > >> With these changes we'll only ever get an -EIO from intel_ring_begin > > >> if the gpu has truely been declared dead. > > >> > > >> v2: Don't conflate the change to prevent intel_ring_begin from returning > > >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is > > >> just prep work to avoid returning -EAGAIN in callsites that can't handle > > >> syscall restarting. > > > > > > I'm really scared by this change. It's diffuclt to review because there > > > are SO many callers of intel_ring_begin, and figuring out if they all > > > work in the wedged case is simply too difficult. I've yet to review the > > > rest of the series, but it may make more sense to put this change last > > > perhaps? > > > > Well, this patch doesn't really affect much what error codes the > > callers get - we'll still throw both -EGAIN and -EIO at them (later on > > patches will fix this). > > > > What this patch does though is prevent us from returning too many > > -EIO. Imagine the gpu died and we've noticed already (hence > > dev_priv->mm.wedged is set), but some process is stuck churning > > through the execbuf ioctl, holding dev->struct_mutex. While processing > > the execbuf we call intel_ring_begin to emit a few commands. Now > > usually, even when the gpu is dead, there is enough space in the ring > > to do so, which allows us to complete the execbuf ioctl and then later > > on we can block properly when trying to grab the mutex in the next > > ioctl for the gpu reset work handler to complete. > > That in itself is a pretty big change, don't you think? It seems rather > strange and dangerous to modify HW (which we will if we allow execbuf to > continue when we write the tail pointer). I think we want some way to > not write the tail ptr in such a case. Now you might respond, well, who > cares about writes? But this imposes a pretty large restriction on any > code that can't work well after the GPU is hung. > > I see the irony. I'm complaining that you can make GPU hangs unstable, > and the patch series is fixing GPU reset. Call it paranoia, it still > seems unsafe to me, and makes us have to think a bit more whenever > adding any code. Am I over-thinking this? I think so. It takes us a few seconds before we notice that the gpu died - only when the hangcheck timer expired at least twice we'll declare the hw wedged. Imo a few seconds is plenty of time to sneak in a few ringbuffer emissions (and hence tail pointer writes) ;-) So I think trying to avoid touching the gpu for another few usecs _is_ overthinking the problem, because fundamentally we can't avoid touching a dead gpu - we will only ever notice its demise after the fact. Yours, Daniel -- Daniel Vetter Mail: daniel at ffwll.ch Mobile: +41 (0)79 365 57 48