Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Fri, Mar 13, 2020 at 5:02 PM Michel Dänzer <michel@xxxxxxxxxxx> wrote:
On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>> <mario.kleiner.de@xxxxxxxxx> wrote:
>>>
>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>>> events at vsartup for DCN")' introduces a new way of pageflip
>>> completion handling for DCN, and some trouble.
>>>
>>> The current implementation introduces a race condition, which
>>> can cause pageflip completion events to be sent out one vblank
>>> too early, thereby confusing userspace and causing flicker:
>>>
>>> prepare_flip_isr():
>>>
>>> 1. Pageflip programming takes the ddev->event_lock.
>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>>> 3. Releases ddev->event_lock.
>>>
>>> --> Deadline for surface address regs double-buffering passes on
>>>      target pipe.
>>>
>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>>     into hw, but too late for current vblank.
>>>
>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>>     in current vblank due to missing the double-buffering deadline
>>>     by a tiny bit.
>>>
>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>>     dm_dcn_crtc_high_irq() gets called.
>>>
>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>>     pageflip has been completed/will complete in this vblank and
>>>     sends out pageflip completion event to userspace and resets
>>>     pflip_status = AMDGPU_FLIP_NONE.
>>>
>>> => Flip completion event sent out one vblank too early.
>>>
>>> This behaviour has been observed during my testing with measurement
>>> hardware a couple of time.
>>>
>>> The commit message says that the extra flip event code was added to
>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
>>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
>>> is clock gated and doesn't fire pflip irqs in that state. Also that
>>> this clock gating may happen if no planes are active. According to
>>> Nicholas, the clock gating can also happen if psr is active, and the
>>> gating is controlled independently by the hardware, so difficult to
>>> detect if and when the completion code in above commit is needed.
>>>
>>> This patch tries the following solution: It only executes the extra
>>> pflip
>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>>> that there aren't any surface updated pending in the double-buffered
>>> surface scanout address registers. Otherwise it leaves pflip completion
>>> to the pflip irq handler, for a more race-free experience.
>>>
>>> This would only guard against the order of events mentioned above.
>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
>>> at all, because 1-3 + 5 might happen even without the hw being
>>> programmed
>>> at all, ie. no surface update pending because none yet programmed
>>> into hw.
>>>
>>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
>>> under event_lock protection within the same critical section.
>>>
>>> v2: Take Nicholas comments into account, try a different solution.
>>>
>>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
>>> Seems to work without causing obvious new trouble.
>>
>> Nick, any comments on this?  Can we get this committed or do you think
>> it needs additional rework?
>>
>> Thanks,
>>
>> Alex
>
> Hi Alex, Mario,
>
> This might be a little strange, but if we want to get this in as a fix
> for regressions caused by the original vblank and user events at
> vstartup patch then I'm actually going to give my reviewed by on the
> *v1* of this patch (but not this v2):
>
> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@xxxxxxx>
>
> You can feel free to apply that one.
>
> Reason 1: After having thought about it some more I don't think we
> enable anything today that has hubp powered down at the same time we
> expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
> entry. Static screen interrupt should happen after that flip finishes I
> think.
>
> The CRTC can still be powered on with zero planes, and I don't think any
> userspace explicitly asks for vblank events in this case but it doesn't
> hurt to have the check.
>

Ok, so the original commit that causes the races currently solves a non-existing - but potential future - problem?
 
I guess then my v1 patch makes sense for the moment and fixes the immediate problem for Linux 5.6-rc.

Maybe there's a way to ask the hw if hubp is clock-gated and so if that code needs to run or not in the future?

As mentioned before, it would be helpful at least for me to get a better overview about which hw events happen when in vblank, which irq's fire in response etc., how this relates to things like power management actions, psr, vrr / drr, etc. A lot has changed in the hw during the last 10 years, and my guideline are still the public pdf files with the DCE register specs for DCE-1'ish hw from sometime around the year 2007.

> Reason 2: This new patch will need much more thorough testing from side
> to fully understand the consequences of locking the entire DC commit
> sequence. For just a page flip that sounds fine, but for anything more
> than (eg. full updates, modesets, etc) I don't think we want to be
> disabling interrupts for potentially many milliseconds.

Ah! I was wondering where the attached splat comes from, but I think
this explains it: With this patch amdgpu_dm_commit_planes keeps the
pcrtc->dev->event_lock spinlock locked while calling
dc_commit_updates_for_stream, which ends up calling
smu_set_display_count, which tries to lock a mutex.


Yep, sorry about that. My most modern machine has Raven Ridge / DCN1, and that function only gets called on Navi /DCN2 afaics. All my testing is limited to Polaris / DCE11 and Raven DCN1, my most modern hw atm.

thanks,
-mario

--
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux