Re: [PATCH] drm/atomic: Perform blocking commits on workqueue

Daniel Vetter <daniel@xxxxxxxx> · Tue, 17 Oct 2023 09:37:08 +0200

On Fri, Oct 13, 2023 at 10:04:02AM -0400, Ray Strode wrote:
> Hi
> 
> On Fri, Oct 13, 2023 at 5:41 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
> > > I mean we're not talking about scientific computing, or code
> > > compilation, or seti@home. We're talking about nearly the equivalent
> > > of `while (1) __asm__ ("nop");`
> >
> > I don't think anyone said this shouldn't be fixed or improved.
> 
> Yea but we apparently disagree that it would be an improvement to stop
> discrediting user space for driver problems.
> You see, to me, there are two problems 1) The behavior itself is not
> nice and should be fixed 2) The blame/accounting/attribution for a
> driver problem is getting directed to user space. I think both issues
> should be fixed. One brought the other issue to light, but both
> problems, in my mind, are legitimate and should be addressed. You
> think fixing the second problem is some tactic to paper over the first
> problem. Christian thinks the second problem isn't a problem at all
> but the correct design.  So none of us are completely aligned and I
> don't see anyone changing their mind anytime soon.
> 
> > What I'm saying is that the atomic ioctl is not going to make guarantees
> > that it will not take up to much cpu time (for some extremely vague value
> > of "too much") to the point that userspace can configure it's compositor
> > in a way that it _will_ get killed if we _ever_ violate this rule.
> yea I find that strange, all kernel tasks have a certain implied
> baseline responsibility to be good citizens to the system.
> And how user space is configured seems nearly immaterial.
> 
> But we're talking in circles.
> 
> > Fixing the worst offenders I don't think will see any pushback at all.
> Yea we all agree fixing this one busy loop is a problem but we don't
> agree on where the cpu time blame should go.
> 
> > > But *this* feels like duct tape: You've already said there's no
> > > guarantee the problem won't also happen during preliminary computation
> > > during non-blocking commits or via other drm entry points. So it
> > > really does seem like a fix that won't age well. I won't be surprised
> > > if in ~3 years (or whatever) in some RHEL release there's a customer
> > > bug leading to the real-time thread getting blocklisted for obscure
> > > server display hardware because it's causing the session to tank on a
> > > production machine.
> >
> > Yes the atomic ioctl makes no guarantees across drivers and hw platforms
> > of guaranteed "we will never violate, you can kill your compositor if you
> > do" cpu bound limits. We'll try to not suck too badly, and we'll try to
> > focus on fixing the suck where it upsets the people the most.
> >
> > But there is fundamentally no hard real time guarantee in atomic. At least
> > not with the current uapi.
> 
> So in your mind mutter should get out of the real-time thread business entirely?

Yes. At least the hard-real time "the kernel kills your process if you
fail" business. Because desktop drawing just isn't hard real-time, nothing
disastrous happens if we miss a deadline.

Of course special setups where everything is very controlled might be
different.

> > The problem isn't about wasting cpu time. It's about you having set up the
> > system in a way so that mutter gets killed if we ever waste too much cpu
> > time, ever.
> 
> mutter is not set up in a way to kill itself if the driver wastes too
> much cpu time,
> ever. mutter is set up in a way to kill itself if it, itself, wastes
> too much cpu time, ever.
> The fact that the kernel is killing it when a kernel driver is wasting
> cpu time is the
> bug that led to the patch in the first place!
> 
> > The latter is flat out not a guarantee the kernel currently makes, and I
> > see no practical feasible way to make that happen. And pretending we do
> > make this guarantee will just result in frustrated users filling bugs that
> > they desktop session died and developers closing them as "too hard to
> > fix".
> 
> I think all three of us agree busy loops are not nice (though maybe we
> aren't completely aligned on severity). And I'll certainly concede
> that fixing all the busy loops can be hard. Some of the cpu-bound code
> paths may even be doing legitimate computation.  I still assert that
> if the uapi calls into driver code that might potentially be doing
> something slow where it can't sleep, it should be doing it on a
> workqueue or thread. That seems orthogonal to fixing all the places
> where the drivers aren't acting nicely.

Again no, because underlying this your requirement boils down to hard real
time.

And we just cannot guarantee that with atomic kms. Best effort,
absolutely. Guaranteed to never fail, no way.

> > Maybe as a bit more useful outcome of this entire discussion: Could you
> > sign up to write a documentation patch for the atomic ioctl that adds a
> > paragraph stating extremely clearly that this ioctl does not make hard
> > real time guarantees, but only best effort soft realtime, and even then
> > priority of the effort is focused entirely on the !ALLOW_MODESET case,
> > because that is the one users care about the most.
> 
> I don't think I'm the best positioned to write such documentation. I
> still think what the kernel is doing is wrong here and I don't even
> fully grok what you mean by best effort.

It's the difference between hard real time and soft real time. atomic kms
is a soft real time system, not hard real time.

You've set up mutter in a hard real time way, and that just doesn't work.

I guess I can type up some patch when I'm back from XDC, but if the
difference between soft and hard real time isn't clear, then yeah you
don't understand the point I've been trying to make in this thread.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch