On 12/15/22 05:29, Michel Dänzer wrote: > On 12/15/22 09:09, Christian König wrote: >> Am 15.12.22 um 00:33 schrieb Alex Hung: >>> On 2022-12-14 16:06, Alex Deucher wrote: >>>> On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>>> On 2022-12-14 15:35, Alex Deucher wrote: >>>>>> On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>>>>> On 2022-12-14 14:54, Alex Deucher wrote: >>>>>>>> On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>>>>>>> On 2022-12-14 13:48, Alex Deucher wrote: >>>>>>>>>> On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai >>>>>>>>>> <aurabindo.pillai@xxxxxxx> wrote: >>>>>>>>>>> >>>>>>>>>>> From: Alex Hung <alex.hung@xxxxxxx> >>>>>>>>>>> >>>>>>>>>>> [Why] >>>>>>>>>>> When running IGT kms_bw test with DP monitor, some systems crash from >>>>>>>>>>> msleep no matter how long or short the time is. >>>>>>>>>>> >>>>>>>>>>> [How] >>>>>>>>>>> To replace msleep with mdelay. >>>>>>>>>> >>>>>>>>>> Can you provide a bit more info about the crash? A lot of platforms >>>>>>>>>> don't support delay larger than 2-4ms so this change will generate >>>>>>>>>> errors on ARM and possibly other platforms. >>>>>>>>>> >>>>>>>>>> Alex >>>>>>>>> >>>>>>>>> The msleep was introduced in eec3303de3378 for non-compliant display >>>>>>>>> port monitors but IGT's kms_bw test can cause a recent h/w to hang at >>>>>>>>> msleep(60) when calling "igt_remove_fb" in IGT >>>>>>>>> (https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/kms_bw.c#L197>>>>>>>>>> >>>>>>>>> It is possible to workaround this by reversing order of >>>>>>>>> igt_remove_fb(&buffer[i]), as the following example: >>>>>>>>> >>>>>>>>> igt_create_color_fb with the order buffer[0], buffer[1], buffer[2] >>>>>>>>> >>>>>>>>> Hangs: >>>>>>>>> igt_remove_fb with the order buffer[0], buffer[1], buffer[2] >>>>>>>>> >>>>>>>>> No hangs: >>>>>>>>> igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0] >>>>>>>>> >>>>>>>>> However, IGT simply exposes the problem and it makes more sense to stop >>>>>>>>> the hang from occurring. >>>>>>>>> >>>>>>>>> I also tried to remove the msleep completely and it also work, but I >>>>>>>>> didn't want to break the fix for the original problematic hardware >>>>>>>>> configuration. >>>>>>>> >>>>>>>> Why does sleep vs delay make a difference? Is there some race that we >>>>>>>> are not locking against? >>>>>>>> >>>>>>>> Alex >>>>>>>> >>>>>>> >>>>>>> That was my original thought but I did not find any previously. I will >>>>>>> investigate it again. >>>>>>> >>>>>>> If mdelay(>4) isn't usable on other platforms, is it an option to use >>>>>>> mdelay on x86_64 only and keep msleep on other platforms or just remove >>>>>>> the msleep for other platforms, something like >>>>>>> >>>>>>> - msleep(60); >>>>>>> +#ifdef CONFIG_X86_64 >>>>>>> + mdelay(60); >>>>>>> +#endif >>>>>> >>>>>> That's pretty ugly. I'd rather try and resolve the root cause. How >>>>>> important is the IGT test? What does it do? Is the test itself >>>>>> correct? >>>>>> >>>>> >>>>> Agreed, and I didn't want to add conditions around the mdelay for the >>>>> same reason. I will assume this is not an option now. >>>>> >>>>> As in the previous comment, IGT can be modified to avoid the crash by >>>>> reversing the order fb is removed - though I suspect I will receive >>>>> questions why this is not fixed in kernel. >>>>> >>>>> I wanted to fix this in kernel because nothing stops other user-space >>>>> applications to use the same way to crash kernel, so fixing IGT is the >>>>> second option. >>>>> >>>>> Apparently causing problems on other platforms isn't an option at all so >>>>> I will try to figure out an non-mdelay solution, and then maybe an IGT >>>>> solution instead. >>>> >>>> What hangs? The test or the kernel or the hardware? >>> >>> The system becomes completely unresponsive - no keyboard, mouse nor remote accesses. >> >> I agree with Alex that changing this is extremely questionable and not justified at all. >> >> My educated guess is that by using mdelay() instead of msleep() we keep the CPU core busy and preventing something from happening at the same time as something else. >> >> This clearly points to missing locking or similar to protect concurrent execution of things. > Might another possibility be that this code gets called from an atomic context which can't sleep? > > It can come through handle_hpd_rx_irq but we're using a workqueue to queue interrupt handling so this shouldn't come from an atomic context. I currently don't see where else it might be used in an atomic context. Alex Hung, can you do a dump_stack() in this function to see where the problematic call is coming from? Fixing IGT will only mask the issue. Userspace should never be able to put the system in a state where it stops responding entirely. This will need some sort of fix in the kernel. Harry