On 12/15/22
10:17, Harry Wentland wrote:
I will drop this patch from the series.On 12/15/22 05:29, Michel Dänzer wrote:On 12/15/22 09:09, Christian König wrote:Am 15.12.22 um 00:33 schrieb Alex Hung:On 2022-12-14 16:06, Alex Deucher wrote:On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung@xxxxxxx> wrote:On 2022-12-14 15:35, Alex Deucher wrote:On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung@xxxxxxx> wrote:On 2022-12-14 14:54, Alex Deucher wrote:On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung@xxxxxxx> wrote:On 2022-12-14 13:48, Alex Deucher wrote:On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai <aurabindo.pillai@xxxxxxx> wrote:From: Alex Hung <alex.hung@xxxxxxx> [Why] When running IGT kms_bw test with DP monitor, some systems crash from msleep no matter how long or short the time is. [How] To replace msleep with mdelay.Can you provide a bit more info about the crash? A lot of platforms don't support delay larger than 2-4ms so this change will generate errors on ARM and possibly other platforms. AlexThe msleep was introduced in eec3303de3378 for non-compliant display port monitors but IGT's kms_bw test can cause a recent h/w to hang at msleep(60) when calling "igt_remove_fb" in IGT (https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/kms_bw.c#L197>>>>>>>>>> It is possible to workaround this by reversing order of igt_remove_fb(&buffer[i]), as the following example: igt_create_color_fb with the order buffer[0], buffer[1], buffer[2] Hangs: igt_remove_fb with the order buffer[0], buffer[1], buffer[2] No hangs: igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0] However, IGT simply exposes the problem and it makes more sense to stop the hang from occurring. I also tried to remove the msleep completely and it also work, but I didn't want to break the fix for the original problematic hardware configuration.Why does sleep vs delay make a difference? Is there some race that we are not locking against? AlexThat was my original thought but I did not find any previously. I will investigate it again. If mdelay(>4) isn't usable on other platforms, is it an option to use mdelay on x86_64 only and keep msleep on other platforms or just remove the msleep for other platforms, something like - msleep(60); +#ifdef CONFIG_X86_64 + mdelay(60); +#endifThat's pretty ugly. I'd rather try and resolve the root cause. How important is the IGT test? What does it do? Is the test itself correct?Agreed, and I didn't want to add conditions around the mdelay for the same reason. I will assume this is not an option now. As in the previous comment, IGT can be modified to avoid the crash by reversing the order fb is removed - though I suspect I will receive questions why this is not fixed in kernel. I wanted to fix this in kernel because nothing stops other user-space applications to use the same way to crash kernel, so fixing IGT is the second option. Apparently causing problems on other platforms isn't an option at all so I will try to figure out an non-mdelay solution, and then maybe an IGT solution instead.What hangs? The test or the kernel or the hardware?The system becomes completely unresponsive - no keyboard, mouse nor remote accesses.I agree with Alex that changing this is extremely questionable and not justified at all. My educated guess is that by using mdelay() instead of msleep() we keep the CPU core busy and preventing something from happening at the same time as something else. This clearly points to missing locking or similar to protect concurrent execution of things.Might another possibility be that this code gets called from an atomic context which can't sleep?It can come through handle_hpd_rx_irq but we're using a workqueue to queue interrupt handling so this shouldn't come from an atomic context. I currently don't see where else it might be used in an atomic context. Alex Hung, can you do a dump_stack() in this function to see where the problematic call is coming from? Fixing IGT will only mask the issue. Userspace should never be able to put the system in a state where it stops responding entirely. This will need some sort of fix in the kernel. Harry |