Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

Christian König <christian.koenig@xxxxxxx> · Tue, 7 Nov 2023 10:13:37 +0100

Am 06.11.23 um 17:46 schrieb Danilo Krummrich:
On Fri, Nov 03, 2023 at 09:15:25AM +0100, Christian König wrote:
Am 02.11.23 um 19:03 schrieb Danilo Krummrich:
> On Thu, Nov 02, 2023 at 11:07:32AM +0100, Christian König wrote:
> > Hi Danilo,
> > > > Am 31.10.23 um 16:01 schrieb Danilo Krummrich:
> > > On Tue, Oct 31, 2023 at 02:20:50PM +0100, Christian König wrote:
> > > > [SNIP]
> > > > Yeah, I see the intention here and can perfectly relate to it 
it's just that
> > > > I have prioritize other things.
> > > I don't see any work being required from your side for this.
> > What I wanted to say is that I understand your intentions and can 
relate to
> > that, but other aspects have higher priority in this discussion.
> What aspects would that be?

Be defensive. As far as I can see this callback is only nice to have 
and not
functionally mandatory.

I gave you quite some reasons why this is desired.

Yeah, but I need something to make it necessary. Desired is just not 
enough in this case.

And generally speaking, even if something is not functionally mandatory,
rejecting it *only* because of that isn't a good idea.

Completely agree. But what makes this at least questionable is the 
combination of not mandatory and giving drivers the opportunity to mess 
with submissions again.

Especially in the scheduler and dma_fence handling we had tons of 
patches which added a nice to have and seemingly harmless features which 
later turned into a complete nightmare to maintain.

The takeaway is that we need more big red warning signs in the form of 
documentation and try to not change things just because it makes them 
look better.

It is better to deal with it in terms of content and consider its pros 
and cons.

And in such cases I have to weight between complexity by adding 
something
which might go boom and being conservative and using well known and 
working
code paths.

> Not breaking other drivers? - The callback is optional, and besides 
that, as
> already mentioned, the callback doesn't do anything that can't 
already go wrong
> with the inital credit value from drm_sched_job_init().

During drm_sched_job_init() time the fence of this job isn't 
published yet.
So when the driver specified something invalid we can easily return 
an error
code and abort.

Inside the scheduler we can't do anything like this. E.g. what 
happens if
the driver suddenly returns a value which is to large? We can't 
reject that.

This is all correct and I recognize that. But again, the callback is 
optional.
I don't see how drivers would be affected not opting in for this feature.

I see this actually as cons argument for the change. If many drivers 
would use this then the code path would be much better tested.

And for drivers which do, they'd have the same problem (if you 
actually want to
call it one) doing some driver specific workaround as well. And 
because they'd
not have the correct hook likely many more.

> Keeping things simple? - The workaround PowerVR is considering instead
> (returning a dma-fence in ->prepare_job()) doesn't seem to 
contribute to this
> goal.

I don't see this as a workaround, this is essentially the desired 
solution.
All dependencies of a job should be represented as a dma-fence if 
possible.

So, you're saying that given we have an inaccurate way of tracking 
credits
(because you're rejecting the callback making it accurate), the 
desired solution
would be to use an existing callback with a whole different, more 
complicated
and hence more error prone approach to overcome that?

The background is that dma-fences have a very well defined semantics 
which
through a rather elaborated hack is validated to the extend that 
lockdep can
check if drivers behave correctly or not: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-fence.c#L194

I fail to see what dma-fences have to do with job-flow control of the 
ring
buffer. You mention it to be a plus that there is a hack to be able to 
validate
dma-fences with lockdep, but you miss the fact that the only reason 
we'd need
this in the first place is that you think it is desired to do job-flow 
control
with dma-fences.

Well I consider the hack better than this callback.

The hack Daniel came up with allows us to validate driver behavior. In 
other words even if things seems to work we get a warning if drivers do 
something they are not supposed to do.

With this here you are actually breaking this because now drivers have
influence again on when stuff is scheduled.

I'm not breaking anything. This flow control mechanism is already in the
scheduler, it's just that it's inaccurate because drivers need to 
calculate
every job with the worst case credit amount. I'm just making it 
accurate, such
that it becomes useful for some drivers.

Well, what's wrong with assuming the worst?

I also do not agree that this breaks anything dma-fence related in 
general. It
doesn't give drivers control of *when* stuff is scheduled. It gives the
scheduler an idea of *if* it can schedule a job without overflowing 
the ring
buffer. The influence of when comes through the hardware finishing a 
previous
job and the corresponding dma-fence callback creating the 
corresponding space
in terms of credit capacity of the scheduler.

That's a really good point, but what if the driver keeps telling the 
scheduler that it can't execute the job?

[SNIP]

> However, I don't see why we want to be less accurate as we could be 
and hence
> risk issues in drivers like PowerVR that are hard to debug.
> > AFAICT, whether we could potentially see a ring run dry because 
of this depends
> on the workload pattern submitted by applications. Breaking down 
such observed
> performance issues to the scheduler not being accurate at this 
point in a few
> month or years would be an absurd effort. So why not just rule this 
out in
> advance?

Because correct dma-fence signaling is more important than driver
performance.

You mean because drivers could return more credits than the scheduler has
capacity and then we WARN() but are still stuck?

Yes, exactly that's my concern here.

The easy fix would be to still WARN() but set the job's credits to the 
maximum
credit size the scheduler can handle.

But that has then the potential for overflowing the ring buffer.

As far as I can see when the driver does something wrong there is no way 
for the scheduler to handle that gracefully. Either we don't fulfill the 
driver's wish to have at least X credits available before executing the 
job or we ignore the fact that we need to guarantee forward progress.

Those signaling bugs are even *much* more problematic than anything the
driver can come up with.

Please see here for an example why this is so problematic: 
https://patches.linaro.org/project/linux-media/patch/20200612070623.1778466-1-daniel.vetter@xxxxxxxx/

We basically said that all dependencies the job submission is based 
on is
expressed as a dma-fence, because dma-fences can be proven to be 
correct.

I'm pretty sure that it's also possible to prove three lines of code 
updating
the job's credit count to be correct...

Maybe we're better off using our energy for that instead?

Yeah, those three lines inside the scheduler are indeed correct but I 
can't validate what the driver is doing.

As soon as we have this callback the calculation to come up with the 
credits is outside of the scheduler.

When we add something like this callback we circumvent this whole 
checking.

Maybe we can approach this from a completely different side. 
Basically we
have static and dynamic dependencies for a job.

The static ones are expressed in the dependencies xarray while the 
dynamic
ones are returned by the prepare callback.

What if we completely remove the credits or ring buffer handling from 
the
scheduler and put it into a separate component which drivers can use a
prepare callback?

While I honestly appreciate you making a proposal to move forward, I 
really
think that this makes everything you're concerned about just worse and 
even
starts me getting concerned about it.

1) We'd trade the complxity of three lines of code to update the 
credit count
with a whole new component.

2) You mentioned that the concerns about the prepare callback were 
proven to be
correct, now you propose to extend it even.

The difference is that Daniel has found a way to validate the prepare 
callback functionality.

When the driver gives us a fence which never signals we get a warning, 
when the driver allocates a new fence we get a warning etc...

We have found tons of bugs around this inside the drivers which ranges 
from allocating memory to taking locks in incorrect order. And we 
haven't even fixed all of them yet, I can still point you to code where 
amdgpu for example is still broken. It's just that this source path 
isn't exercised this often so we can live with it for now.

I really don't want to repeat that stunt with this callback again. At 
least not for something which is only nice to have.

The update_job_credits() has a very limited scope and hence is way 
easier to
validate for correctness. The only thing we really need to do is to 
guarantee
forward progress, meaning we just don't allow the job's credits to 
exceed the
schedulers credit capacity. And that should be it.

3) You seem to imply that you want to use dma-fences to do job-flow 
control,
which as I mentioned above has way more complexity (separately 
allocated object,
locking restrictions, etc.) and hence is more error prone.

4) Why would we remove something from the scheduler which can be 
considered to
be one of its core responsibilities?

Yeah, good points. I wanted to at least throw out some ideas how to 
solve this instead of just bluntly rejecting it.

Regards,
Christian.

> > Again, if this callback would introduce any complexity or 
tradeoffs for existing
> drivers, I'd vote to wait until we actually see issues as well. But 
this simply
> isn't the case. And if you think otherwise, please show me the 
complexity it
> introduces that is concerning and the tradeoffs for existing drivers.
> > > > > > > If this here has some measurable positive effect then 
yeah we should
> > > > > > probably do it, but as long as it's only nice to have I 
have some objections
> > > > > > to that.
> > > > > Can't answer this, since Nouveau doesn't support native 
fence waits. However, I
> > > > > guess it depends on how many native fences a job carries. 
So, if we'd have two
> > > > > jobs with each of them carrying a lot of native fences, but 
not a lot of actual
> > > > > work, I can very well imagine that over-accounting can have 
a measureable
> > > > > impact.
> > > > What I can imagine as well is things like the hardware or 
firmware is
> > > > looking at the fullness of the ring buffer to predict how 
much pending work
> > > > there is.
> > > > > > > > > I just wonder if we really want to ask for real 
measurements given that the
> > > > > optimization is rather trivial and cheap and we already 
have enough evidence
> > > > > that it makes sense conceptually.
> > > > Well that's the point I disagree on, this callback isn't 
trivial. If (for
> > > > example) the driver returns a value larger than initially 
estimated for the
> > > > job we can run into an endless loop.
> > > I agree it doesn't make sense to increase, but it wouldn't 
break anything,
> > > unless the job size starts exceeding the capacity of the ring. 
And this case is
> > > handled anyway.
> > > > > > > It's just one more thing which can go boom. At bare 
minimum we should check
> > > > that the value is always decreasing.
> > > Considering the above I still agree, such a check makes sense - 
gonna add one.
> > Please don't send anything out until we have solved this.
> I did, but don't worry about this, it helped me to finalize the 
rest of the
> patch. We can keep discussing this in this thread either way.
> > > So far I haven't seen any argument which would not let me 
reject this and I
> > don't want to waste your time.
> > > > Regards,
> > Christian.
> > > > > - Danilo
> > > > > > > Christian.
> > > > > > > > > - Danilo
> > > > > > > > > > > Regards,
> > > > > > Christian.
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > Boris