Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> · Mon, 11 Mar 2024 09:51:29 -0700

On 3/11/2024 6:43 AM, Johan Hovold wrote:
On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote:
On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:
On 3/8/2024 4:43 AM, Johan Hovold wrote:

For this last remaining reset with the stacktrace you have mentioned
below, I do not think this was introduced due to PM runtime series. We
have had this report earlier with the same stacktrace as well in earlier
kernels which didnt have PM runtime support:

But, it seems like there is another race condition in this code
resulting in this issue again.

I have posted my analysis with the patch here as a RFC y'day:

https://patchwork.freedesktop.org/patch/581758/

I missed CCing you and Bjorn on the RFC but when I post it as a patch
either today/tomm, will CC you both.

Ok, thanks. I'll take a closer look at that next week. It's not clear to
me what that race looks like after reading the commit message. It would
be good to have some more details in there (e.g. the sequence of events
that breaks the state machine) and some way to reproduce the issue
reliably.

I was able to reproduce the reset with some of my debug printks in place
after reapplying the reverted hpd notify change so I have an explanation
for (one of) the ways we can up in this state now.

This does not match description of the problem in the fix linked to
above and I don't think that patch solves the underlying issue even if
it may make the race window somewhat smaller. More details below.

Its the same condition you described below that enable does not go 
through and we bail out as we are in ST_DISCONNECTED.

We now also have Bjorn's call trace from a different Qualcomm platform
triggering an unclocked access on disconnect, something which would
trigger a reset by the hypervisor on sc8280xp platforms like the X13s:

[ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
[ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
[ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
[ 2030.379642] Call trace:

[ 2030.379722]  wait_for_completion_timeout+0x14/0x20
[ 2030.379727]  dp_ctrl_push_idle+0x34/0x8c [msm]
[ 2030.379844]  dp_bridge_atomic_disable+0x18/0x24 [msm]
[ 2030.379959]  drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
[ 2030.380150]  drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper]
[ 2030.380200]  msm_atomic_commit_tail+0x1b4/0x474 [msm]
[ 2030.380316]  commit_tail+0xa4/0x158 [drm_kms_helper]
[ 2030.380369]  drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
[ 2030.380418]  drm_atomic_commit+0xa8/0xd4 [drm]
[ 2030.380529]  drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
[ 2030.380641]  drm_client_modeset_commit_locked+0x50/0x168 [drm]
[ 2030.380753]  drm_client_modeset_commit+0x2c/0x54 [drm]
[ 2030.380865]  __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper]
[ 2030.380915]  drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
[ 2030.380965]  msm_fbdev_client_hotplug+0x28/0xc8 [msm]
[ 2030.381081]  drm_client_dev_hotplug+0x94/0x118 [drm]
[ 2030.381192]  output_poll_execute+0x214/0x26c [drm_kms_helper]
[ 2030.381241]  process_scheduled_works+0x19c/0x2cc
[ 2030.381249]  worker_thread+0x290/0x3cc
[ 2030.381255]  kthread+0xfc/0x184
[ 2030.381260]  ret_from_fork+0x10/0x20

The above could happen if the convoluted hotplug state machine breaks
down so that the device is runtime suspended before
dp_bridge_atomic_disable() is called.

Yes, state machine got broken and I have explained how in the commit
text of the RFC. But its not necessarily due to PM runtime but a
sequence of events happening from userspace exposing this breakage.

After looking at this some more today, I agree with you. The
observations I've reported are consistent with what would happen if the
link clock is disabled when dp_bridge_atomic_disable() is called.

That clock is disabled in dp_bridge_atomic_post_disable() before runtime
suspending, but perhaps there are some further paths that can end up
disabling it.

Turns out there are paths like that and we hit those more often before
reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify().

Specifically, when a previous connect attempt did not enable the bridge
fully so that it is still in the ST_MAINLINK_READY when we receive a
disconnect event, dp_hpd_unplug_handle() will turn of the link clock.

[  204.527625] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
[  204.531553] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
[  204.533261] msm-dp-display ae98000.displayport-controller: dp_ctrl_off_link

A racing connect event, such as the one I described earlier, can then
try to enable the bridge again but dp_bridge_atomic_enable() just bails
out early (and leaks a rpm reference) because we're now in
ST_DISCONNECTED:

[  204.535773] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
[  204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to connected
[  204.536905] msm-dp-display ae98000.displayport-controller: dp_display_notify_disconnect - would clear link ready (1), state = 0
[  204.537821] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_check - link_ready = 1
[  204.538063] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0, link_ready = 1
[  204.542778] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable
[  204.586547] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable - state = 0 (rpm leak?)

Clearing link_ready already in dp_display_notify_disconnect() would make
the race window slightly smaller, but it would essentially just paper
over the bug as the events are still not serialised. Notably, there is
no user space interaction involved here and it's the spurious connect
event that triggers the bridge enable.

Yes, it only narrows down the race condition window. The issue can still 
happen if the commit / modeset was issued before we marked link_ready as 
false.

And yes, I was only targetting a short term fix till we rework the HPD. 
That will happen only incrementally as its a delicate piece of code.

When the fbdev hotplug code later disables the never-enabled bridge,
things go boom:

[  204.649072] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
[  204.650378] [CONNECTOR:35:DP-2] status updated from connected to disconnected
[  204.651111] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_disable

as the link clock has already been disabled when accessing the link
registers.

The stack trace for the bridge enable above is:

[  204.553922]  dp_bridge_atomic_enable+0x40/0x2f0 [msm]
[  204.555241]  drm_atomic_bridge_chain_enable+0x54/0xc8 [drm]
[  204.556557]  drm_atomic_helper_commit_modeset_enables+0x194/0x26c [drm_kms_helper]
[  204.557853]  msm_atomic_commit_tail+0x204/0x804 [msm]
[  204.559173]  commit_tail+0xa4/0x18c [drm_kms_helper]
[  204.560450]  drm_atomic_helper_commit+0x19c/0x1b0 [drm_kms_helper]
[  204.561743]  drm_atomic_commit+0xa4/0xdc [drm]
[  204.563065]  drm_client_modeset_commit_atomic+0x22c/0x298 [drm]
[  204.564402]  drm_client_modeset_commit_locked+0x60/0x1c0 [drm]
[  204.565733]  drm_client_modeset_commit+0x30/0x58 [drm]
[  204.567055]  __drm_fb_helper_restore_fbdev_mode_unlocked+0xbc/0xfc [drm_kms_helper]
[  204.568381]  drm_fb_helper_hotplug_event.part.0+0xd4/0x110 [drm_kms_helper]
[  204.569708]  drm_fb_helper_hotplug_event+0x38/0x44 [drm_kms_helper]
[  204.571032]  msm_fbdev_client_hotplug+0x28/0xd4 [msm]
[  204.572395]  drm_client_dev_hotplug+0xcc/0x130 [drm]
[  204.573755]  drm_kms_helper_connector_hotplug_event+0x34/0x44 [drm_kms_helper]
[  204.575114]  drm_bridge_connector_hpd_cb+0x90/0xa4 [drm_kms_helper]
[  204.576465]  drm_bridge_hpd_notify+0x40/0x5c [drm]
[  204.577842]  drm_aux_hpd_bridge_notify+0x18/0x28 [aux_hpd_bridge]
[  204.579184]  pmic_glink_altmode_worker+0xc0/0x23c [pmic_glink_altmode]

For some reason, possibly due to unrelated changes in timing, possibly
after the hotplug revert, I am no longer able to reproduce the reset
with 6.8-rc7 on the X13s.

I do not know how the hotplug revert fixed this stack because I think
this can still happen.

So, while it may still be theoretically possible to hit the resets after
the revert, the HPD notify revert effectively "fixed" the regression in
6.8-rc1 by removing the preconditions that now made us hit it (i.e. the
half-initialised bridge).

Not entirely. In the bug which was reported before the patch 
e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() got landed, its a 
classic example of how this issue can happen with userspace involvement 
and not just fbdev client which was your case.

It seems the hotplug state machine needs to be reworked completely, but
at least we're roughly back where we were with 6.7 (including that the
bus clocks will never be turned of because of the rpm leaks on
disconnect).

Yes, we already landed that revert but I am also planning to land the 
patch I had posted till we rework HPD.

Johan