Re: [PATCH 05/18] drm/amd/display: Use mdelay to avoid crashes

Harry Wentland <harry.wentland@xxxxxxx> · Thu, 15 Dec 2022 10:17:37 -0500

On 12/15/22 05:29, Michel Dänzer wrote:
> On 12/15/22 09:09, Christian König wrote:
>> Am 15.12.22 um 00:33 schrieb Alex Hung:
>>> On 2022-12-14 16:06, Alex Deucher wrote:
>>>> On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung@xxxxxxx> wrote:
>>>>> On 2022-12-14 15:35, Alex Deucher wrote:
>>>>>> On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung@xxxxxxx> wrote:
>>>>>>> On 2022-12-14 14:54, Alex Deucher wrote:
>>>>>>>> On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung@xxxxxxx> wrote:
>>>>>>>>> On 2022-12-14 13:48, Alex Deucher wrote:
>>>>>>>>>> On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai
>>>>>>>>>> <aurabindo.pillai@xxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>> From: Alex Hung <alex.hung@xxxxxxx>
>>>>>>>>>>>
>>>>>>>>>>> [Why]
>>>>>>>>>>> When running IGT kms_bw test with DP monitor, some systems crash from
>>>>>>>>>>> msleep no matter how long or short the time is.
>>>>>>>>>>>
>>>>>>>>>>> [How]
>>>>>>>>>>> To replace msleep with mdelay.
>>>>>>>>>>
>>>>>>>>>> Can you provide a bit more info about the crash?  A lot of platforms
>>>>>>>>>> don't support delay larger than 2-4ms so this change will generate
>>>>>>>>>> errors on ARM and possibly other platforms.
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>> The msleep was introduced in eec3303de3378 for non-compliant display
>>>>>>>>> port monitors but IGT's kms_bw test can cause a recent h/w to hang at
>>>>>>>>> msleep(60) when calling "igt_remove_fb" in IGT
>>>>>>>>> (https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/kms_bw.c#L197>>>>>>>>>>
>>>>>>>>> It is possible to workaround this by reversing order of
>>>>>>>>> igt_remove_fb(&buffer[i]), as the following example:
>>>>>>>>>
>>>>>>>>>       igt_create_color_fb with the order buffer[0], buffer[1], buffer[2]
>>>>>>>>>
>>>>>>>>> Hangs:
>>>>>>>>>       igt_remove_fb with the order buffer[0], buffer[1], buffer[2]
>>>>>>>>>
>>>>>>>>> No hangs:
>>>>>>>>>       igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0]
>>>>>>>>>
>>>>>>>>> However, IGT simply exposes the problem and it makes more sense to stop
>>>>>>>>> the hang from occurring.
>>>>>>>>>
>>>>>>>>> I also tried to remove the msleep completely and it also work, but I
>>>>>>>>> didn't want to break the fix for the original problematic hardware
>>>>>>>>> configuration.
>>>>>>>>
>>>>>>>> Why does sleep vs delay make a difference?  Is there some race that we
>>>>>>>> are not locking against?
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>
>>>>>>> That was my original thought but I did not find any previously. I will
>>>>>>> investigate it again.
>>>>>>>
>>>>>>> If mdelay(>4) isn't usable on other platforms, is it an option to use
>>>>>>> mdelay on x86_64 only and keep msleep on other platforms or just remove
>>>>>>> the msleep for other platforms, something like
>>>>>>>
>>>>>>> -                       msleep(60);
>>>>>>> +#ifdef CONFIG_X86_64
>>>>>>> +                       mdelay(60);
>>>>>>> +#endif
>>>>>>
>>>>>> That's pretty ugly.  I'd rather try and resolve the root cause.  How
>>>>>> important is the IGT test?  What does it do?  Is the test itself
>>>>>> correct?
>>>>>>
>>>>>
>>>>> Agreed, and I didn't want to add conditions around the mdelay for the
>>>>> same reason. I will assume this is not an option now.
>>>>>
>>>>> As in the previous comment, IGT can be modified to avoid the crash by
>>>>> reversing the order fb is removed - though I suspect I will receive
>>>>> questions why this is not fixed in kernel.
>>>>>
>>>>> I wanted to fix this in kernel because nothing stops other user-space
>>>>> applications to use the same way to crash kernel, so fixing IGT is the
>>>>> second option.
>>>>>
>>>>> Apparently causing problems on other platforms isn't an option at all so
>>>>> I will try to figure out an non-mdelay solution, and then maybe an IGT
>>>>> solution instead.
>>>>
>>>> What hangs?  The test or the kernel or the hardware?
>>>
>>> The system becomes completely unresponsive - no keyboard, mouse nor remote accesses.
>>
>> I agree with Alex that changing this is extremely questionable and not justified at all.
>>
>> My educated guess is that by using mdelay() instead of msleep() we keep the CPU core busy and preventing something from happening at the same time as something else.
>>
>> This clearly points to missing locking or similar to protect concurrent execution of things.
> Might another possibility be that this code gets called from an atomic context which can't sleep?
> 
> 

It can come through handle_hpd_rx_irq but we're using a workqueue
to queue interrupt handling so this shouldn't come from an atomic
context. I currently don't see where else it might be used in an
atomic context. Alex Hung, can you do a dump_stack() in this function
to see where the problematic call is coming from?

Fixing IGT will only mask the issue. Userspace should never be able
to put the system in a state where it stops responding entirely. This
will need some sort of fix in the kernel.

Harry