Re: [Intel-gfx] [PATCH] drm/i915: Fix EIO/wedged handling in gem fault handler

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Thu, 4 Sep 2014 08:22:39 +0100

On Thu, Sep 04, 2014 at 08:17:42AM +0100, Chris Wilson wrote:
> On Thu, Sep 04, 2014 at 09:11:23AM +0200, Daniel Vetter wrote:
> > In
> > 
> > commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c
> > Author: Daniel Vetter <daniel.vetter@xxxxxxxx>
> > Date:   Thu Nov 15 17:17:22 2012 +0100
> > 
> >     drm/i915: clear up wedged transitions
> > 
> > I've accidentally inverted the EIO/wedged handling in the fault
> > handler: We want to return the EIO as a SIGBUS only if it's not
> > because of the gpu having died, to prevent userspace from unduly
> > dying.
> > 
> > In my defence the comment right above is completely misleading, so fix
> > both.
> > 
> > v2: Drop the WARN_ON, it's not actually a bug to e.g. receive an -EIO
> > when swap-in fails.
> 
> You mean drop the WARN_ON and the if...
>  
> > Reported-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Cc: stable@xxxxxxxxxxxxxxx
> > Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxxx>
> > ---
> >  drivers/gpu/drm/i915/i915_gem.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 6c685708a516..247ad18b8c3e 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -1596,9 +1596,12 @@ unlock:
> >  out:
> >  	switch (ret) {
> >  	case -EIO:
> > -		/* If this -EIO is due to a gpu hang, give the reset code a
> > -		 * chance to clean up the mess. Otherwise return the proper
> > -		 * SIGBUS. */
> > +		/*
> > +		 * We eat errors when the gpu is terminally wedged to avoid
> > +		 * userspace unduly crashing (gl has no provisions for mmaps to
> > +		 * fail). But any other -EIO isn't ours (e.g. swap in failure)
> > +		 * and so needs to be reported.
> > +		 */
> >  		if (i915_terminally_wedged(&dev_priv->gpu_error)) {
> 
> Because you didn't actually make the change you set out to do :)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 377d619827bb..b11fbe9332d5 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1541,23 +1541,21 @@ unlock:
        mutex_unlock(&dev->struct_mutex);
 out:
        switch (ret) {
-       case -EIO:
-               /* If this -EIO is due to a gpu hang, give the reset code a
-                * chance to clean up the mess. Otherwise return the proper
-                * SIGBUS. */
-               if (!i915_terminally_wedged(&dev_priv->gpu_error)) {
-                       ret = VM_FAULT_SIGBUS;
-                       break;
-               }
+       case 0:
+               /*
+                * Success. You may proceed.
+                */
        case -EAGAIN:
                /*
                 * EAGAIN means the gpu is hung and we'll wait for the error
                 * handler to reset everything when re-faulting in
                 * i915_mutex_lock_interruptible.
                 */
-       case 0:
        case -ERESTARTSYS:
        case -EINTR:
+               /* Signal interruptus. Return to userspace and await
+                * further instructions.
+                */
        case -EBUSY:
                /*
                 * EBUSY is ok: this just means that another thread
@@ -1568,6 +1566,20 @@ out:
        case -ENOMEM:
                ret = VM_FAULT_OOM;
                break;
+       case -EIO:
+               /* The driver should never report an EIO due to GPU hangs,
+                * all requests shold have been reset and the object should
+                * be idle. There should be no impediment for us to mmap the
+                * object in those cases. It would be nice if we could throw
+                * a warning in such cases, but...
+                *
+                * EIO can also arise from a failure to swap in pages,
+                * for example. Here, we must report SIGBUS as there is
+                * no recovery for memory corruption.
+                *
+                * And since we can have a wedged GPU and swap failure,
+                * having a WARN here could be misleading.
+                */
        case -ENOSPC:
        case -EFAULT:
                ret = VM_FAULT_SIGBUS;


-- 
Chris Wilson, Intel Open Source Technology Centre
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html