Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread

Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> · Wed, 13 Mar 2024 10:24:08 -0700

On 3/13/2024 1:18 AM, Johan Hovold wrote:
On Tue, Mar 12, 2024 at 10:39:46AM -0700, Abhinav Kumar wrote:
On 3/12/2024 9:59 AM, Johan Hovold wrote:

Heh. This is getting ridiculous. I just tried running with this patch
and it again breaks hotplug detect in a VT console and in X (where I
could enable a reconnected external display by running xrandr twice
before).

So, please, do not apply this one.

To make things worse, I indeed also hit the reset when disconnecting
after such a failed hotplug.

Ack, I will hold off till I analyze your issues more which you have
listed in separate replies. Especially about the spurious connect, I
believe you are trying to mention that, by adding logs, you are able to
delay the processing of a connect event to *make* it like a spurious
one? In case, I got this part wrong, can you pls explain the spurious
connect scenario again?

No, I only mentioned the debug printks in passing as instrumentation
like that may affect race conditions (but I'm also hitting the resets
also with no printks in place).

The spurious connect event comes directly from the pmic firmware, and
even if we may optimise things by implementing some kind of debounce,
the hotplug implementation needs to be robust enough to not kill the
machine if such an event gets through.

Basically what I see is that during physical disconnect there can be
multiple hpd notify events (e.g. connect, disconnect, connect):

[  146.910195] usb 5-1: USB disconnect, device number 4
[  146.931026] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
[  146.934785] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
[  146.938114] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
[  146.940245] [CONNECTOR:35:DP-2] status updated from disconnected to connected
[  146.955193] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2

And it is the spurious connect event while the link is being tore down
that triggers the hotplug processing that leads to the reset.

Similarly, I've seen spurious disconnect events while the plug in being
inserted.

This is quite weird and also explains why most of the issues were seen 
only with sc8280xp. pmic spurious events are busting the hpd logic.

Agreed, that DP driver should be robust enough to handle this but this 
will also bust usermode to send down unnecessary frames. Someone should 
address why these spurious events are coming.

A short response on why this change was made is that commit can be
issued by userspace or the fbdev client. So userspace involvement only
makes commit happen from a different path. It would be incorrect to
assume the issues from the earlier bug and the current one are different
only because there was userspace involvement in that one and not this.

Because in the end, it manifests itself in the same way that
atomic_enable() did not go through after an atomic_disable() and the
next atomic_disable() crashes.

Right, but your proposed fix would not actually fix anything and judging
from the sparse commit message and diff itself it is clearly only meant
to mitigate the case where user space is involved, which is *not* the
case here.

No, I think there is some disconnect in the way you are reading that 
patch perhaps due to some missing details OR I am missing your point.

Like I said, drm_atomic_commit() can be issued by userspace or the fbdev 
client in the driver. Thats the only userspace involvement.

Now, why the patch was made or was expected to work.

There can be a race condition between the time the DP driver gets the 
hpd disconnect event and when the hpd thread processes that event 
allowing the commit to sneak in. This is something which has always been 
there even without pm_runtime series and remains even today.

In this race condition, the setting of "link_ready" to false can be a 
bit delayed if we go through the HPD event processing increasing the 
race condition window.

If link_ready is false, atomic_check() fails, thereby failing any 
commits and hence not allowing the atomic_disable() / atomic_enable() 
cycle and hence avoiding this reset.

The patch is moving the setting of link_ready to false earlier by not 
putting it through the HPD event thread and hence trying to reduce the 
window of the issue.

Johan