Re: [PATCH 05/18] drm/amd/display: Use mdelay to avoid crashes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 12/15/22 10:17, Harry Wentland wrote:

On 12/15/22 05:29, Michel Dänzer wrote:
On 12/15/22 09:09, Christian König wrote:
Am 15.12.22 um 00:33 schrieb Alex Hung:
On 2022-12-14 16:06, Alex Deucher wrote:
On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung@xxxxxxx> wrote:
On 2022-12-14 15:35, Alex Deucher wrote:
On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung@xxxxxxx> wrote:
On 2022-12-14 14:54, Alex Deucher wrote:
On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung@xxxxxxx> wrote:
On 2022-12-14 13:48, Alex Deucher wrote:
On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai
<aurabindo.pillai@xxxxxxx> wrote:
From: Alex Hung <alex.hung@xxxxxxx>

[Why]
When running IGT kms_bw test with DP monitor, some systems crash from
msleep no matter how long or short the time is.

[How]
To replace msleep with mdelay.
Can you provide a bit more info about the crash?  A lot of platforms
don't support delay larger than 2-4ms so this change will generate
errors on ARM and possibly other platforms.

Alex
The msleep was introduced in eec3303de3378 for non-compliant display
port monitors but IGT's kms_bw test can cause a recent h/w to hang at
msleep(60) when calling "igt_remove_fb" in IGT
(https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/kms_bw.c#L197>>>>>>>>>>
It is possible to workaround this by reversing order of
igt_remove_fb(&buffer[i]), as the following example:

      igt_create_color_fb with the order buffer[0], buffer[1], buffer[2]

Hangs:
      igt_remove_fb with the order buffer[0], buffer[1], buffer[2]

No hangs:
      igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0]

However, IGT simply exposes the problem and it makes more sense to stop
the hang from occurring.

I also tried to remove the msleep completely and it also work, but I
didn't want to break the fix for the original problematic hardware
configuration.
Why does sleep vs delay make a difference?  Is there some race that we
are not locking against?

Alex

That was my original thought but I did not find any previously. I will
investigate it again.

If mdelay(>4) isn't usable on other platforms, is it an option to use
mdelay on x86_64 only and keep msleep on other platforms or just remove
the msleep for other platforms, something like

-                       msleep(60);
+#ifdef CONFIG_X86_64
+                       mdelay(60);
+#endif
That's pretty ugly.  I'd rather try and resolve the root cause.  How
important is the IGT test?  What does it do?  Is the test itself
correct?

Agreed, and I didn't want to add conditions around the mdelay for the
same reason. I will assume this is not an option now.

As in the previous comment, IGT can be modified to avoid the crash by
reversing the order fb is removed - though I suspect I will receive
questions why this is not fixed in kernel.

I wanted to fix this in kernel because nothing stops other user-space
applications to use the same way to crash kernel, so fixing IGT is the
second option.

Apparently causing problems on other platforms isn't an option at all so
I will try to figure out an non-mdelay solution, and then maybe an IGT
solution instead.
What hangs?  The test or the kernel or the hardware?
The system becomes completely unresponsive - no keyboard, mouse nor remote accesses.
I agree with Alex that changing this is extremely questionable and not justified at all.

My educated guess is that by using mdelay() instead of msleep() we keep the CPU core busy and preventing something from happening at the same time as something else.

This clearly points to missing locking or similar to protect concurrent execution of things.
Might another possibility be that this code gets called from an atomic context which can't sleep?


It can come through handle_hpd_rx_irq but we're using a workqueue
to queue interrupt handling so this shouldn't come from an atomic
context. I currently don't see where else it might be used in an
atomic context. Alex Hung, can you do a dump_stack() in this function
to see where the problematic call is coming from?

Fixing IGT will only mask the issue. Userspace should never be able
to put the system in a state where it stops responding entirely. This
will need some sort of fix in the kernel.

Harry


I will drop this patch from the series.

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux