Re: [Intel-gfx] i915 PSR test results and cursor lag

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 05, 2018 at 11:54:04PM +0000, Andy Lutomirski wrote:
> 
> 
> > On Feb 5, 2018, at 2:50 PM, Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> wrote:
> > 
> >> On Sat, Feb 03, 2018 at 05:33:08PM +0000, Andy Lutomirski wrote:
> >>> On Fri, Feb 2, 2018 at 7:18 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >>>> On Fri, Feb 2, 2018 at 1:24 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >>>>> On Thu, Feb 1, 2018 at 9:20 PM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>> Quoting Andy Lutomirski (2018-02-01 21:04:30)
> >>>>>> I got this after a recent suspend/resume:
> >>>>>> 
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed.
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan all dirs
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]:   device-enumerator:
> >>>>>> scanning /sys/bus
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]:   device-enumerator:
> >>>>>> scanning /sys/class
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open
> >>>>>> configuration file '/etc/systemd/sleep.conf': No such file or
> >>>>>> directory
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending...
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal
> >>>>>> sender=n/a destination=n/a object=/org/freedesktop/login1
> >>>>>> interface=org.freedesktop.login1.Manager member=PrepareForSleep
> >>>>>> cookie=570 reply
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Got message
> >>>>>> type=method_call sender=:1.46 destination=:1.1
> >>>>>> object=/org/freedesktop/login1/session/_32
> >>>>>> interface=org.freedesktop.login1.Session member=ReleaseDevice
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal
> >>>>>> sender=n/a destination=:1.46
> >>>>>> object=/org/freedesktop/login1/session/_32
> >>>>>> interface=org.freedesktop.login1.Session member=PauseDevice cookie
> >>>>>> Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane
> >>>>>> transform 0: Permission denied
> >>>>>> Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed
> >>>>>> with (Permission denied), drawing cursor with OpenGL from now on
> >>>>>> 
> >>>>>> But I don't see the word "cursor" in my system logs before the first
> >>>>>> suspend.  What am I looking for?  This is Fedora 27 running a Gnome
> >>>>>> Wayland session, but it hasn't been reinstalled in some time, so it's
> >>>>>> possible that there are some weird settings sitting around.  But I did
> >>>>>> check and I have no weird i915 parameters.
> >>>>> 
> >>>>> You are using gnome-shell as the display server. From that it appears to
> >>>>> have started off with a HW cursor and switched to a SW cursor after
> >>>>> suspend. Did you notice a change in behaviour? After rebooting or just
> >>>>> restarting gnome-shell?
> >>>> 
> >>>> I think it's less consistently bad after a reboot before suspending.
> >>>> 
> >>>>> 
> >>>>>> Also, are these things potentially related:
> >>>>>> 
> >>>>>> [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential
> >>>>>> atomic update failure on pipe A
> >>>>> 
> >>>>> They are just "missed the immediate vblank for the screen update"
> >>>>> messages. Should not be related to PSR, but may cause jitter by delaying
> >>>>> the odd screen update.
> >>>> 
> >>>> I just got this one, and the timestamp is at least reasonably close to
> >>>> a giant latency spike:
> >>>> 
> >>>> [  288.799654] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic
> >>>> update failure on pipe A (start=31 end=32) time 15 us, min 1073, max
> >>>> 1079, scanline start 1087, end 1088
> >>>> 
> >>>>> 
> >>>>>> As I'm typing this, I've seen a couple instances of what seems like a
> >>>>>> full *second* of cursor latency, but I've only gotten the potential
> >>>>>> atomic update failure once.
> >>>>>> 
> >>>>>> And is there any straightforward tracing to do to distinguish between
> >>>>>> PSR exit latency and other potential sources of latency?
> >>>>> 
> >>>>> It looks plausible that we could at least report how long it takes the
> >>>>> registers to reflect the change in state (but we don't). The best source
> >>>>> of information atm is /sys/kernel/debug/dri/0/i915_edp_psr_status.
> >>>> 
> >>>> Hmm.
> >>>> 
> >>>> I went and looked at the code, and I noticed what could be bugs or
> >>>> could (more likely) be my confusion since I don't know this code at
> >>>> all:
> >>>> 
> >>>> intel_single_frame_update() does something inscrutable to me, but I
> >>>> imagine it does something that causes the next page flip to get
> >>>> noticed by the panel even with PSR on.  But how does the code that
> >>>> calls it know that anything happened?  (Looking at the commit history,
> >>>> maybe this is something special that's only needed on some platforms
> >>>> but doesn't replace the normal PSR exit sequence.)
> >>>> 
> >>>> Perhaps more interestingly, intel_psr_flush() does this:
> >>>> 
> >>>>    /* By definition flush = invalidate + flush */
> >>>>    if (frontbuffer_bits)
> >>>>        intel_psr_exit(dev_priv);
> >>>> 
> >>>>    if (!dev_priv->psr.active && !dev_priv->psr.busy_frontbuffer_bits)
> >>>>        if (!work_busy(&dev_priv->psr.work.work))
> >>>>            schedule_delayed_work(&dev_priv->psr.work,
> >>>>                          msecs_to_jiffies(100));
> >>>> 
> >>>> I'm guessing that the idea is that we're turning off PSR because we
> >>>> want the panel to update and we expect that, in 100ms, the update will
> >>>> have hit the panel and we'll have been idle long enough for it to make
> >>>> sense to re-enter PSR.  IOW, the code wants PSR to be off for at least
> >>>> 100ms and then to turn back on.  But this code actually says "turn PSR
> >>>> back on in at *most* 100ms".  What happens if there are two screen
> >>>> updates 99ms apart?  The first one should work fine, but the next one
> >>>> will hit with 1ms left on the delayed work, and intel_psr_work() will
> >>>> get called in 1ms.  There's some magic with busy_frontbuffer_bits, but
> >>>> it seems questionable to me that intel_psr_flush() clears
> >>>> busy_frontbuffer_bits and *then* calls intel_psr_exit().
> >>>> 
> >>>> Naively, I would expect that PSR needs to be kept off until the vblank
> >>>> following the page flip.
> >>>> 
> >>>> Also, in intel_psr_work(), shouldn't this code:
> >>>> 
> >>>>    /*
> >>>>     * The delayed work can race with an invalidate hence we need to
> >>>>     * recheck. Since psr_flush first clears this and then reschedules we
> >>>>     * won't ever miss a flush when bailing out here.
> >>>>     */
> >>>>    if (dev_priv->psr.busy_frontbuffer_bits)
> >>>>        goto unlock;
> >>>> 
> >>>> re-arm the delayed work?
> >>>> 
> >>>> Anyway, this is all on a 4.14 kernel.  I should update to 4.16 and see
> >>>> what happens.
> >>> 
> >>> I updated to 4.15, and the situation is much worse.  With
> >>> enable_psr=1, the system survives for several seconds and then the
> >>> screen stops updating entirely.  If I boot with i915.enable_psr=1, I
> >>> get to the Fedora login screen and then the system dies.  If I set
> >>> enable_psr=1 using sysfs, it does a bit after the next resume.  It
> >>> seems like it also sometimes hangs even worse a bit after the screen
> >>> stops updating, but it's hard to tell.
> >>> 
> >>> I see this in my logs:
> >>> 
> >>> [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR*
> >>> [CRTC:37:pipe A] flip_done timed out
> >>> 
> >>> Sometimes I see this a bit later:
> >>> 
> >>> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR*
> >>> [CRTC:37:pipe A] flip_done timed out
> >>> 
> >>> I'm able to get some debugging out before the system dies.  I see
> >>> intel_psr_flush() getting called a bunch, and I don't see
> >>> intel_psr_invalidate() being called at all.  I also see
> >>> intel_psr_work() activating psr as little as 2ms after
> >>> intel_psr_flush() finishes.  So I think the code is indeed buggy or at
> >>> least questionable.  I'd try to fix it (at least as well as I can
> >>> without knowing anything about how the PSR state machine actually
> >>> works), but the fact that the system hangs would make it very hard to
> >>> test.
> >> 
> >> I filed https://bugs.freedesktop.org/show_bug.cgi?id=104931 for the
> >> questionable reactivation logic issue.  If nothing else, I doubt that
> >> the reactivation timeout logic does what its author intended for it to
> >> do.
> > 
> > Thanks for that. I will check the logic and your patch here.
> > 
> > One thing to consider is that when activating PSR HW engine will
> > consider the number of static frames before transitioning it to
> > active state. So, in general the immediate return to active status
> > shouldn't be a problem.
> > At least not for core platforms. maybe for vlv/chv.
> > 
> 
> Hmm.  Is there some register that can be poked to tell the hardware "hey, I updated the image even if your dirty tracking didn't notice, so turn off PSR if it's on and reset your idle frame counter"?  If so, the logic could be dramatically simplified.

that would be my dream ;)

But the only thing we can do is to disable psr and re-enable it. So the frame counter will restart.
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux