Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)

Alex Deucher <alexdeucher@xxxxxxxxx> · Tue, 5 May 2020 13:03:54 -0400

Mario or Nick any thoughts?

Alex

On Mon, May 4, 2020 at 1:35 PM Matt Coffin <mcoffin13@xxxxxxxxx> wrote:
>
> Hey guys,
>
> This is still an issue for me, and I'm still having to run a patch to
> revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi
> setups in 5.7, is there any news on this? Has anyone else at the very
> least been able to reproduce the problem?
>
> It happens for me in every single program that mesa allows to utilize
> variable refresh rates, and reverting it "fixes" the issue.
>
> Cheers, and sorry for the extra email, just making sure this is still on
> someone's radar,
> Matt
>
> On 4/14/20 5:32 PM, Matt Coffin wrote:
> > Hey everyone,
> >
> > This patch broke variable refresh rate in games (all that I've tried so
> > far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as
> > well as a simple freesync tester application.
> >
> > FreeSync tester I've been using: https://github.com/Nixola/VRRTest
> >
> > I'm not at all familiar with the page flipping code, so it would take me
> > a long time to find the *right* way to fix it, but does someone else see
> > why it would do that?
> >
> > The symptom is that the refresh rate of the display constantly bounces
> > between the two ends of the FreeSync range (for me 40 -> 144), and the
> > game stutters like a madman.
> >
> > Any help on where to start, ideas on how to fix it (other than just
> > revert this commit, which I've done in the interim), or alternative
> > patches would be appreciated.
> >
> > Thanks in advance for the work/help,
> > Matt
> >
> > On 3/13/20 8:42 AM, Michel Dänzer wrote:
> >> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
> >>> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
> >>>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
> >>>> <mario.kleiner.de@xxxxxxxxx> wrote:
> >>>>>
> >>>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
> >>>>> events at vsartup for DCN")' introduces a new way of pageflip
> >>>>> completion handling for DCN, and some trouble.
> >>>>>
> >>>>> The current implementation introduces a race condition, which
> >>>>> can cause pageflip completion events to be sent out one vblank
> >>>>> too early, thereby confusing userspace and causing flicker:
> >>>>>
> >>>>> prepare_flip_isr():
> >>>>>
> >>>>> 1. Pageflip programming takes the ddev->event_lock.
> >>>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
> >>>>> 3. Releases ddev->event_lock.
> >>>>>
> >>>>> --> Deadline for surface address regs double-buffering passes on
> >>>>>      target pipe.
> >>>>>
> >>>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
> >>>>>     into hw, but too late for current vblank.
> >>>>>
> >>>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
> >>>>>     in current vblank due to missing the double-buffering deadline
> >>>>>     by a tiny bit.
> >>>>>
> >>>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
> >>>>>     dm_dcn_crtc_high_irq() gets called.
> >>>>>
> >>>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
> >>>>>     pageflip has been completed/will complete in this vblank and
> >>>>>     sends out pageflip completion event to userspace and resets
> >>>>>     pflip_status = AMDGPU_FLIP_NONE.
> >>>>>
> >>>>> => Flip completion event sent out one vblank too early.
> >>>>>
> >>>>> This behaviour has been observed during my testing with measurement
> >>>>> hardware a couple of time.
> >>>>>
> >>>>> The commit message says that the extra flip event code was added to
> >>>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
> >>>>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
> >>>>> is clock gated and doesn't fire pflip irqs in that state. Also that
> >>>>> this clock gating may happen if no planes are active. According to
> >>>>> Nicholas, the clock gating can also happen if psr is active, and the
> >>>>> gating is controlled independently by the hardware, so difficult to
> >>>>> detect if and when the completion code in above commit is needed.
> >>>>>
> >>>>> This patch tries the following solution: It only executes the extra
> >>>>> pflip
> >>>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
> >>>>> that there aren't any surface updated pending in the double-buffered
> >>>>> surface scanout address registers. Otherwise it leaves pflip completion
> >>>>> to the pflip irq handler, for a more race-free experience.
> >>>>>
> >>>>> This would only guard against the order of events mentioned above.
> >>>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
> >>>>> at all, because 1-3 + 5 might happen even without the hw being
> >>>>> programmed
> >>>>> at all, ie. no surface update pending because none yet programmed
> >>>>> into hw.
> >>>>>
> >>>>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
> >>>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
> >>>>> under event_lock protection within the same critical section.
> >>>>>
> >>>>> v2: Take Nicholas comments into account, try a different solution.
> >>>>>
> >>>>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
> >>>>> Seems to work without causing obvious new trouble.
> >>>>
> >>>> Nick, any comments on this?  Can we get this committed or do you think
> >>>> it needs additional rework?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Alex
> >>>
> >>> Hi Alex, Mario,
> >>>
> >>> This might be a little strange, but if we want to get this in as a fix
> >>> for regressions caused by the original vblank and user events at
> >>> vstartup patch then I'm actually going to give my reviewed by on the
> >>> *v1* of this patch (but not this v2):
> >>>
> >>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@xxxxxxx>
> >>>
> >>> You can feel free to apply that one.
> >>>
> >>> Reason 1: After having thought about it some more I don't think we
> >>> enable anything today that has hubp powered down at the same time we
> >>> expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
> >>> entry. Static screen interrupt should happen after that flip finishes I
> >>> think.
> >>>
> >>> The CRTC can still be powered on with zero planes, and I don't think any
> >>> userspace explicitly asks for vblank events in this case but it doesn't
> >>> hurt to have the check.
> >>>
> >>> Reason 2: This new patch will need much more thorough testing from side
> >>> to fully understand the consequences of locking the entire DC commit
> >>> sequence. For just a page flip that sounds fine, but for anything more
> >>> than (eg. full updates, modesets, etc) I don't think we want to be
> >>> disabling interrupts for potentially many milliseconds.
> >>
> >> Ah! I was wondering where the attached splat comes from, but I think
> >> this explains it: With this patch amdgpu_dm_commit_planes keeps the
> >> pcrtc->dev->event_lock spinlock locked while calling
> >> dc_commit_updates_for_stream, which ends up calling
> >> smu_set_display_count, which tries to lock a mutex.
> >>
> >>
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >>
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx