Re: [PATCH] drm/atomic: Perform blocking commits on workqueue

Ray Strode <halfline@xxxxxxxxx> · Thu, 28 Sep 2023 10:20:29 -0400

Hi,

On Thu, Sep 28, 2023 at 9:24 AM Christian König
<christian.koenig@xxxxxxx> wrote:
> If you see a large delay in the dpms off case then we probably have a driver bug somewhere.

This is something we both agree on, I think.

>> I'm getting the idea that you think there is some big bucket of kernel
>> syscalls that block for a large fraction of a second by design and are
>> not meant to reset RLIMIT_RTTIME.
>
> Yes, exactly that's the case as far as I know.

Okay, well i'm willing to be proven wrong. My previously stated list
is: futex, sched_yield, and maybe a few obscure ioctls. If you have a
big bucket of syscalls in your mind, i'm all ears. I think I'm
probably right and the lions share of all syscalls that can block for
a long time don't leave the process RUNNABLE, though.

> The purpose of RLIMIT_RTTIME is to have a watchdog if an application is still responsive. Only things which make the application look for outside events are considered a reset for this whatdog.
>
> So for example calling select() and poll() will reset the watchdog, but not calling any DRM modeset functions or an power management IOCTL for example.

what power management ioctls are you talking about?

>
> There are only two possibilities I can see how a DRM modeset is triggering this:
>
> 1. We create enough CPU overhead to actually exceed the limit. In this case triggering it is certainly intentional.

As I said before I suspect that not all modeset work for all drivers
is happening in the process context already anyway. So if this is
intentional, I suspect it's not actually working as designed (but I
haven't confirmed that, aside from my `git grep -E
'INIT_WORK|DECLARE_TASKLET|open_softirq|timer_setup|kthread_run' | wc
-l` litmus test) . I can't come up with a good reason it would be
designed like that, however, and you haven't provided one, so I
suspect it's not actually intentional.

>> I don't think you really thought through what you're saying here. It
>> just flatly doesn't apply for drmModeAtomicCommit. What is an
>> application supposed to do?
>
> Get terminated and restart. That's what the whole functionality is good for.
>
> If you don't desire that than don't use the RLIMIT_RTTIME functionality.

No, getting terminated and restarting the session because
drmModeAtomicCommit is taking too long is not really acceptable
behavior for a display server. In general, crashing the session is
maybe better than locking up the entire system hard, so RLIMIT_RTTIME
is still a good idea, but in either case we're in bug land here, each
such bug needs to be fixed. In this case, the bug is
drmModeAtomicCommit isn't sleeping like nearly every other syscall
does.

> From the RLIMIT_RTTIME documentation: "The intended use of this limit
> is to stop a runaway real-time process from locking up the system."
>
> And when drmAtomicCommit() is triggering this then we either have a
> problem with the application doing something it is not supposed to do
> (like blocking for vblank while it should listen to network traffic) or
> the driver is somehow buggy.
Yes! This is about bugs.

> drmModeAtomicCommit() is used by display servers. If drmModeAtomicCommit
> runs away in e.g. a set of 100ms busy loops responding to a confused or
> slow responding GPU, the system will seemingly lock up to the user. That
> is an intractable problem that we can not get away from. It doesn't
> matter if the kernel is stuck in process context or on a workqueue. And,
> regardless, it's not reasonable to expect userspace to craft elaborate
> workarounds for driver bugs. We just have to fix the bugs.
>
> Exactly that's the reason why I'm pointing out that this isn't a good idea.

Not following your logic here.

> That just papers over the problem. The process doesn't become more responsive by hiding the problem.

It's not about the process being responsive. In meat-space, the
sub-second delay is imperceptible because the screen is turned off.

It's about not killing an innocent display server because of a driver bug.

> What you need to do here is to report those problems to the driver teams and not try to hide them this way.

There is already a bug, it's mentioned in the merge request, the
reason I cc'd you (vs say just Dave and Daniel) is because the driver
bug is in amdgpu code.

--Ray