Re: [PATCH] drm/atomic: Perform blocking commits on workqueue

Ray Strode <halfline@xxxxxxxxx> · Thu, 12 Oct 2023 14:19:41 -0400

Hi,

On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote:
> > > > To be clear, my take is, if driver code is running in process context
> > > > and needs to wait for periods of time on the order of or in excess of
> > > > a typical process time slice it should be sleeping during the waiting.
> > > > If the operation is at a point where it can be cancelled without side
> > > > effects, the sleeping should be INTERRUPTIBLE. If it's past the point
> > > > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
> > > > opinion, should kernel code busy block a typical process for dozens of
> > > > milliseconds while keeping the process RUNNING. I don't think this is
> > > > a controversial take.
> > > Exactly that's what I completely disagree on.

Okay if we can't agree that it's not okay for user space (or the
kernel running in the context of user space) to busy loop a cpu core
at 100% utilization throughout and beyond the process's entire
scheduled time slice then we really are at an impasse. I gotta say i'm
astonished that this seemingly indefensible behavior is somehow a
point of contention, but I'm not going to keep arguing about it beyond
this email.

I mean we're not talking about scientific computing, or code
compilation, or seti@home. We're talking about nearly the equivalent
of `while (1) __asm__ ("nop");`

> > The key point here is that the patch puts the work into the background just
> > to avoid that it is accounted to the thread issuing it, and that in turn is
> > not valid as far as I can see.
>
> Yeah it's that aspect I'm really worried about, because we essentially
> start to support some gurantees that a) most drivers can't uphold without
> a huge amount of work, some of the DC state recomputations are _really_
> expensive b) without actually making the semantics clear, it's just
> duct-tape.

If DC plane state computation (or whatever) is really taking 50ms or
200ms, then it probably should be done in chunks so it doesn't get
preempted at an inopportune point? Look, this is not my wheelhouse,
this is your wheelhouse, and I don't want to keep debating forever. It
seems there is a discrepancy between our understandings of implied
acceptable behavior.

> Yes compositors want to run kms in real-time, and yes that results in fun
> if you try to strictly account for cpu time spent. Especially if your
> policy is to just nuke the real time thread instead of demoting it to
> SCHED_NORMAL for a time.

So I ended up going with this suggestion for blocking modesets:

https://gitlab.gnome.org/GNOME/mutter/-/commit/5d3e31a49968fc0da04e98c0f9d624ea5095c9e0

But *this* feels like duct tape: You've already said there's no
guarantee the problem won't also happen during preliminary computation
during non-blocking commits or via other drm entry points. So it
really does seem like a fix that won't age well. I won't be surprised
if in ~3 years (or whatever) in some RHEL release there's a customer
bug leading to the real-time thread getting blocklisted for obscure
server display hardware because it's causing the session to tank on a
production machine.

> I think if we want more than hacks here we need to answer two questions:
> - which parts of the kms api are real time
> - what exactly do we guarantee with that

imo, this isn't just about real-time versus non-real-time. It's no
more acceptable for non-real-time mutter to be using 100% CPU doing
busywaits than it is for real-time mutter to be using 100% cpu doing
busywaits.

Also, both you and Christian have suggested using the non-blocking
modeset api with a fence fd to poll on is equivalent to the blocking
api flushing the commit_tail work before returning from the ioctl, but
that's not actually true. I think we all now agree the EBUSY problem
you mentioned as an issue with my proposed patch wasn't actually a
problem for blocking commits, but that very same issue is a problem
with the non-blocking commits that then block on a fence fd, right? I
guess we'd need to block on a fence fd from the prior non-blocking
commit first before starting the blocking commit (or something)

--Ray