On 12/15/22 09:09, Christian König wrote: > Am 15.12.22 um 00:33 schrieb Alex Hung: >> On 2022-12-14 16:06, Alex Deucher wrote: >>> On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>> On 2022-12-14 15:35, Alex Deucher wrote: >>>>> On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>>>> On 2022-12-14 14:54, Alex Deucher wrote: >>>>>>> On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>>>>>> On 2022-12-14 13:48, Alex Deucher wrote: >>>>>>>>> On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai >>>>>>>>> <aurabindo.pillai@xxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> From: Alex Hung <alex.hung@xxxxxxx> >>>>>>>>>> >>>>>>>>>> [Why] >>>>>>>>>> When running IGT kms_bw test with DP monitor, some systems crash from >>>>>>>>>> msleep no matter how long or short the time is. >>>>>>>>>> >>>>>>>>>> [How] >>>>>>>>>> To replace msleep with mdelay. >>>>>>>>> >>>>>>>>> Can you provide a bit more info about the crash? A lot of platforms >>>>>>>>> don't support delay larger than 2-4ms so this change will generate >>>>>>>>> errors on ARM and possibly other platforms. >>>>>>>>> >>>>>>>>> Alex >>>>>>>> >>>>>>>> The msleep was introduced in eec3303de3378 for non-compliant display >>>>>>>> port monitors but IGT's kms_bw test can cause a recent h/w to hang at >>>>>>>> msleep(60) when calling "igt_remove_fb" in IGT >>>>>>>> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Figt-gpu-tools%2F-%2Fblob%2Fmaster%2Ftests%2Fkms_bw.c%23L197&data=05%7C01%7Calex.hung%40amd.com%7C81664450189542a7bbc408dade27d0e9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638066559844526853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=M%2BF4H2qddXItPoUZRVyhlc5N8UF1%2FHIh8PpfT%2BCmdZ4%3D&reserved=0) >>>>>>>> >>>>>>>> It is possible to workaround this by reversing order of >>>>>>>> igt_remove_fb(&buffer[i]), as the following example: >>>>>>>> >>>>>>>> igt_create_color_fb with the order buffer[0], buffer[1], buffer[2] >>>>>>>> >>>>>>>> Hangs: >>>>>>>> igt_remove_fb with the order buffer[0], buffer[1], buffer[2] >>>>>>>> >>>>>>>> No hangs: >>>>>>>> igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0] >>>>>>>> >>>>>>>> However, IGT simply exposes the problem and it makes more sense to stop >>>>>>>> the hang from occurring. >>>>>>>> >>>>>>>> I also tried to remove the msleep completely and it also work, but I >>>>>>>> didn't want to break the fix for the original problematic hardware >>>>>>>> configuration. >>>>>>> >>>>>>> Why does sleep vs delay make a difference? Is there some race that we >>>>>>> are not locking against? >>>>>>> >>>>>>> Alex >>>>>>> >>>>>> >>>>>> That was my original thought but I did not find any previously. I will >>>>>> investigate it again. >>>>>> >>>>>> If mdelay(>4) isn't usable on other platforms, is it an option to use >>>>>> mdelay on x86_64 only and keep msleep on other platforms or just remove >>>>>> the msleep for other platforms, something like >>>>>> >>>>>> - msleep(60); >>>>>> +#ifdef CONFIG_X86_64 >>>>>> + mdelay(60); >>>>>> +#endif >>>>> >>>>> That's pretty ugly. I'd rather try and resolve the root cause. How >>>>> important is the IGT test? What does it do? Is the test itself >>>>> correct? >>>>> >>>> >>>> Agreed, and I didn't want to add conditions around the mdelay for the >>>> same reason. I will assume this is not an option now. >>>> >>>> As in the previous comment, IGT can be modified to avoid the crash by >>>> reversing the order fb is removed - though I suspect I will receive >>>> questions why this is not fixed in kernel. >>>> >>>> I wanted to fix this in kernel because nothing stops other user-space >>>> applications to use the same way to crash kernel, so fixing IGT is the >>>> second option. >>>> >>>> Apparently causing problems on other platforms isn't an option at all so >>>> I will try to figure out an non-mdelay solution, and then maybe an IGT >>>> solution instead. >>> >>> What hangs? The test or the kernel or the hardware? >> >> The system becomes completely unresponsive - no keyboard, mouse nor remote accesses. > > I agree with Alex that changing this is extremely questionable and not justified at all. > > My educated guess is that by using mdelay() instead of msleep() we keep the CPU core busy and preventing something from happening at the same time as something else. > > This clearly points to missing locking or similar to protect concurrent execution of things. Might another possibility be that this code gets called from an atomic context which can't sleep? -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer