Re: Panic after S3 resume and modeset with MST

Lyude Paul <lyude@xxxxxxxxxx> · Thu, 30 Mar 2017 14:07:42 -0400



On Thu, 2017-03-30 at 07:55 +0200, Takashi Iwai wrote:
> On Thu, 30 Mar 2017 02:24:37 +0200,
> Lyude Paul wrote:
> > 
> > On Wed, 2017-03-29 at 15:54 +0200, Takashi Iwai wrote:
> > > On Wed, 29 Mar 2017 15:34:15 +0200,
> > > Ville Syrjälä wrote:
> > > > 
> > > > On Wed, Mar 29, 2017 at 03:10:09PM +0200, Takashi Iwai wrote:
> > > > > On Mon, 27 Mar 2017 18:02:13 +0200,
> > > > > Takashi Iwai wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > the upstream fix a16b7658f4e0d4aec9bc3e75a5f0cc3f7a3a0422
> > > > > >     drm/i915: Call intel_dp_mst_resume() before resuming
> > > > > > displays
> > > > > > 
> > > > > > seems to trigger a kernel panic when some modeset change
> > > > > > happens after
> > > > > > S3 resume.  The details are found in openSUSE bugzilla,
> > > > > >   https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > > > 
> > > > > > In short, the following procedure causes a kernel panic
> > > > > > (supposedly)
> > > > > > almost 100% on Dell Latitude with Skylake with MST DP on
> > > > > > dock:
> > > > > > 
> > > > > > - Boot with a docking station, DP-1 connected.
> > > > > > - Login on X
> > > > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --
> > > > > > auto 
> > > > > > --left-of eDP-1
> > > > > >   ==> This changes the mode.
> > > > > > - Suspend ("systemctl suspend" in my case), and close the
> > > > > > lid.
> > > > > > - Remove from the dock (keep the lid closed).
> > > > > > - Open the lid, which resumes automatically.  It works.
> > > > > > - Suspend again.
> > > > > > - Connect to the dock again (keep the lid closed).
> > > > > > - Open the lid, which resumes automatically.  It's still
> > > > > > OK.
> > > > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --
> > > > > > auto 
> > > > > > --left-of eDP-1
> > > > > >   ==> Now the kernel feezes.
> > > > > > 
> > > > > > Reverting the commit mentioned above fixes the problem.
> > > > > > 
> > > > > > The problem is present in all versions I tested.  The
> > > > > > reported
> > > > > > kernel
> > > > > > in the Bugzilla is 4.4.x-based one, but the issue is seen
> > > > > > in
> > > > > > 4.11-rc3,
> > > > > > too.  Note that the S3 resume itself works in 4.11-rc3; the
> > > > > > kernel
> > > > > > panic happens when invoking xrandr manually after that.
> > > > > > 
> > > > > > Unfortunately, I couldn't get a kernel panic message, so
> > > > > > far.  kdump
> > > > > > didn't work well in this case by some reason.  There are
> > > > > > some
> > > > > > screenshots taken by the original reporter (could switch VT
> > > > > > beforehand), but I don't know whether it helps.
> > > > > > 
> > > > > > If you have any hints for further debugging, it'd be highly
> > > > > > appreciated.
> > > > > 
> > > > > It seems that the patch below works around the problem.
> > > > > Can anyone enlighten what's going on there?
> > > > > 
> > > > > 
> > > > > thanks,
> > > > > 
> > > > > Takashi
> > > > > 
> > > > > -- 8< --
> > > > > From: Takashi Iwai <tiwai@xxxxxxx>
> > > > > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP
> > > > > MST
> > > > > mode change
> > > > > 
> > > > > We've got a bug report showing that Skylake Dell machines
> > > > > with a
> > > > > docking station causes a kernel panic after S3 resume and
> > > > > modeset.
> > > > > The details are found in the openSUSE bugzilla entry
> > > > > below.  The
> > > > > typical test procedure is:
> > > > > 
> > > > > - Laptop is Dell Latitude with eDP (1366x768)
> > > > > - Boot with docking station connected to a DP (1920x1080)
> > > > > - Login, change the mode via
> > > > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-
> > > > > of
> > > > > eDP-1
> > > > > - Suspend, and close the lid after the suspend
> > > > >   (or close the lid to trigger the suspend)
> > > > > - Undock while keeping the lid closed.
> > > > > - Open the lid, which triggers the resume;
> > > > >   the machine wakes up well, and X shows up.  No problem, so
> > > > > far.
> > > > > - Suspend again, close the lid.
> > > > > - Dock again while keeping the lid closed.
> > > > > - Open the lid, triggering the resume; this wakes up still
> > > > > fine.
> > > > > - At this moment, run xrandr again to re-setup DP-1
> > > > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-
> > > > > of
> > > > > eDP-1
> > > > >   ==> This triggers a hard crash.
> > > > > 
> > > > > I could bisect it, and this leaded to the commit a16b7658f4e0
> > > > > ("drm/i915: Call intel_dp_mst_resume() before resuming
> > > > > displays").
> > > > > 
> > > > > Basically the commit just shuffles the calls of
> > > > > intel_display_resume()
> > > > > and intel_dp_mst_resume().  So as a workaround, I tried to
> > > > > split
> > > > > intel_dp_mst_resume() call to postpone the suspected code
> > > > > (the
> > > > > invocation of intel_dp_check_mst_status()), then bingo, this
> > > > > cured the
> > > > > problem.
> > > > > 
> > > > > But don't ask me *why* this fixes.  It's still in a cargo-
> > > > > cult
> > > > > state.
> > > > > 
> > > > > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume()
> > > > > before
> > > > > resuming displays")
> > > > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > > Signed-off-by: Takashi Iwai <tiwai@xxxxxxx>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/i915_drv.c  |  5 ++++-
> > > > >  drivers/gpu/drm/i915/intel_dp.c  | 20 +++++++++++++++++++-
> > > > >  drivers/gpu/drm/i915/intel_drv.h |  3 ++-
> > > > >  3 files changed, 25 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_drv.c
> > > > > b/drivers/gpu/drm/i915/i915_drv.c
> > > > > index 1c75402a59c1..62c40090ceed 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > > > @@ -1559,6 +1559,7 @@ static int
> > > > > i915_suspend_switcheroo(struct
> > > > > drm_device *dev, pm_message_t state)
> > > > >  static int i915_drm_resume(struct drm_device *dev)
> > > > >  {
> > > > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > +	int mst_pending;
> > > > >  	int ret;
> > > > >  
> > > > >  	disable_rpm_wakeref_asserts(dev_priv);
> > > > > @@ -1608,10 +1609,12 @@ static int i915_drm_resume(struct
> > > > > drm_device *dev)
> > > > >  		dev_priv->display.hpd_irq_setup(dev_priv);
> > > > >  	spin_unlock_irq(&dev_priv->irq_lock);
> > > > >  
> > > > > -	intel_dp_mst_resume(dev);
> > > > > +	mst_pending = intel_dp_mst_resume(dev);
> > > > >  
> > > > >  	intel_display_resume(dev);
> > > > >  
> > > > > +	intel_dp_mst_resume_post(dev, mst_pending);
> > > > > +
> > > > >  	drm_kms_helper_poll_enable(dev);
> > > > >  
> > > > >  	/*
> > > > > diff --git a/drivers/gpu/drm/i915/intel_dp.c
> > > > > b/drivers/gpu/drm/i915/intel_dp.c
> > > > > index d1670b8afbf5..fc5ea900e6f3 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_dp.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > > > > @@ -6027,9 +6027,10 @@ void intel_dp_mst_suspend(struct
> > > > > drm_device *dev)
> > > > >  	}
> > > > >  }
> > > > >  
> > > > > -void intel_dp_mst_resume(struct drm_device *dev)
> > > > > +int intel_dp_mst_resume(struct drm_device *dev)
> > > > >  {
> > > > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > +	int pending = 0;
> > > > >  	int i;
> > > > >  
> > > > >  	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > > > @@ -6041,6 +6042,23 @@ void intel_dp_mst_resume(struct
> > > > > drm_device
> > > > > *dev)
> > > > >  
> > > > >  		ret =
> > > > > drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> > > > >  		if (ret)
> > > > > +			pending |= 1 << i;
> > > > > +	}
> > > > > +
> > > > > +	return pending;
> > > > > +}
> > > > > +
> > > > > +void intel_dp_mst_resume_post(struct drm_device *dev, int
> > > > > pending)
> > > > > +{
> > > > > +	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > +	int i;
> > > > > +
> > > > > +	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > > > +		struct intel_digital_port *intel_dig_port =
> > > > > +			dev_priv->hotplug.irq_port[i];
> > > > > +		if (!intel_dig_port || !intel_dig_port-
> > > > > > dp.can_mst)
> > > > > 
> > > > > +			continue;
> > > > > +		if (pending & (1 << i))
> > > > >  			intel_dp_check_mst_status(&intel_dig
> > > > > _por
> > > > > t->dp);
> > > > >  	}
> > > > >  }
> > > > 
> > > > The whole MST resume is a bit of chicken and egg type of
> > > > situation.
> > > > We
> > > > need the HPD interrupts to resume the previous state, but we
> > > > don't
> > > > want
> > > > to actually process real hotplugs until we've done the resume.
> > > > The
> > > > current code is definitely broken IMO.
> > > > 
> > > > But I'm not really sure why this patch fixes things because the
> > > > HPD
> > > > processing that will occur when we talk to the sink during the
> > > > display
> > > > resume should also call intel_dp_check_mst_status().
> > > 
> > > Actually, just dropping intel_dp_check_mst_status() calls in
> > > intel_dp_mst_resume() seems enough to fix the problem.
> > > 
> > > Below is the v2 patch doing that.  Does it make more sense?
> > 
> > Barely, unfortunately :(. I'll be honest I'm a little surprised
> > this
> > patch doesn't break anything, but you do seem to be right in saying
> > that hotplugging re-sets up the displays in the event that things
> > go
> > wrong.
> > 
> > Going through the backtrace and the dmesg that you gave on the
> > bugzilla
> > I can't figure out any definitive answer for why this is crashing.
> > We
> > will have to get dmesg output from when it crashes to understand
> > this
> > I'm pretty sure.
> > 
> > JFYI (I really need to make a guide for this somewhere...) I
> > usually
> > use the netconsole module to get kernel output from machines when
> > they
> > kernel panic. In your case I would set this up right before you do
> > the
> > final `xrandr` call that kills the machine, since netconsole has
> > not
> > been very reliable in my experience over suspend/resume cycles. It
> > might take a few tries to get it to print the whole crash to the
> > netconsole (and you will need to make sure the system has a PCI
> > ethernet adapter, USB or wifi won't do), but this usually works.
> 
> I've already tried netconsole (even before kdump), but unfortunately
> it doesn't work in this case.  As mentioned in Bugzilla, the network
> goes crazy when the machine crashes (even the whole network segment
> the machine connects to is stalling).  It's similar like the symptom
> we've seen with the SKL CPU-state screw up in the past with
> intel_idle
> driver.
Do you think you might be able to get a translation of the stack dump
to the actual lines this issues is occurring on? You should be able to
do this against i915.ko using eu-addr2line (it will have to be against
the same kernel build you're running on the crashing machine, hence why
I can't just do it myself). I might see if I can give you some patches
to spit some info out.

> 
> > If you could try getting a full dmesg with this, that would be
> > super
> > appreciated.
> > 
> > JFYI: in the future you should use git send-email for sending these
> > patches to intel-gfx, since the format you sent it in is preventing
> > your patch from getting noticed by patchwork. Other then that
> > though:
> > 
> > Reviewed-by: Lyude <lyude@xxxxxxxxxx>
> 
> Thanks, it's not meant to be a real submission of the patch, but
> currently it's still a sort of RFC.  If there is no better solution
> insight, I'll submit it as a proper fix patch.
> 
> 
> Takashi
> 
> > > 
> > > 
> > > thanks,
> > > 
> > > Takashi
> > > 
> > > -- 8< --
> > > From: Takashi Iwai <tiwai@xxxxxxx>
> > > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP MST
> > > mode
> > > change
> > >  (v2)
> > > 
> > > We've got a bug report showing that Skylake Dell machines with a
> > > docking station causes a kernel panic after S3 resume and
> > > modeset.
> > > The details are found in the openSUSE bugzilla entry below.  The
> > > typical test procedure is:
> > > 
> > > - Laptop is Dell Latitude with eDP (1366x768)
> > > - Boot with docking station connected to a DP (1920x1080)
> > > - Login, change the mode via
> > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of
> > > eDP-1
> > > - Suspend, and close the lid after the suspend
> > >   (or close the lid to trigger the suspend)
> > > - Undock while keeping the lid closed.
> > > - Open the lid, which triggers the resume;
> > >   the machine wakes up well, and X shows up.  No problem, so far.
> > > - Suspend again, close the lid.
> > > - Dock again while keeping the lid closed.
> > > - Open the lid, triggering the resume; this wakes up still fine.
> > > - At this moment, run xrandr again to re-setup DP-1
> > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of
> > > eDP-1
> > >   ==> This triggers a hard crash.
> > > 
> > > I could bisect it, and this leaded to the commit a16b7658f4e0
> > > ("drm/i915: Call intel_dp_mst_resume() before resuming
> > > displays").
> > > 
> > > This patch tries to work around the crash by ignoring the failed
> > > port
> > > from drm_dp_mst_topology_mgr_resume().  They should be handled in
> > > hpd
> > > later in anyway.
> > > 
> > > v1->v2: just ignore the drm_dp_mst_topology_mgr_resume() error
> > > codes
> > >         instead of postponing.
> > > 
> > > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume() before
> > > resuming displays")
> > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > Signed-off-by: Takashi Iwai <tiwai@xxxxxxx>
> > > ---
> > >  drivers/gpu/drm/i915/intel_dp.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/intel_dp.c
> > > b/drivers/gpu/drm/i915/intel_dp.c
> > > index d1670b8afbf5..a6c0f0ac16eb 100644
> > > --- a/drivers/gpu/drm/i915/intel_dp.c
> > > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > > @@ -6041,6 +6041,7 @@ void intel_dp_mst_resume(struct drm_device
> > > *dev)
> > >  
> > >  		ret =
> > > drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> > >  		if (ret)
> > > -			intel_dp_check_mst_status(&intel_dig_por
> > > t-
> > > > dp);
> > > 
> > > +			DRM_DEBUG_KMS("DP MST resume failed for
> > > port-%c\n",
> > > +				      port_name(intel_dig_port-
> > > > port));
> > > 
> > >  	}
> > >  }
> > 
> > -- 
> > Cheers,
> > 	Lyude
> > 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx