Re: [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback

Christian König <christian.koenig@xxxxxxx> · Thu, 9 Mar 2023 09:05:57 +0100

Am 09.03.23 um 07:30 schrieb Asahi Lina:
On 09/03/2023 05.14, Christian König wrote:
I think you mean wake_up_interruptible(). That would be
drm_sched_job_done(), on the fence callback when a job completes, which
as I keep saying is the same logic used for
hw_rq_count/hw_submission_limit tracking.
As the documentation to wait_event says:

   * wake_up() has to be called after changing any variable that could
   * change the result of the wait condition.

So what you essentially try to do here is to skip that and say
drm_sched_job_done() would call that anyway, but when you read any
variable to determine that state then as far as I can see nothing is
guarantying that order.
The driver needs to guarantee that any changes to that state precede a
job completion fence signal of course, that's the entire idea of the
API. It's supposed to represent a check for per-scheduler (or more
specific, but not more global) resources that are released on job
completion. Of course if you misuse the API you could cause a problem,
but what I'm trying to say is that the API as designed and when used as
intended does work properly.

Put another way: job completions always need to cause the sched main
loop to run an iteration anyway (otherwise we wouldn't make forward
progress), and job completions are exactly the signal that the
can_run_job() condition may have changed.

The only other possibility how you could use the callback correctly
would be to call drm_fence_is_signaled() to query the state of your hw
submission from the same fence which is then signaled. But then the
question is once more why you don't give that fence directly to the
scheduler?
But the driver is supposed to guarantee that the ordering is always 1.
resources freed, 2. fence signaled. So you don't need to check for the
fence, you can just check for the resource state.

Yeah, but this is exactly what the dma_fence framework tried to prevent. 
We try very hard to avoid such side channel signaling :)

But putting that issue aside for a moment. What I don't get is when you 
have such intra queue dependencies, then why can't you check that at a 
much higher level?

In other words even userspace should be able to predict that for it's 
submissions X amount of resources are needed and when all of my 
submissions run in parallel that won't work.

Asking the firmware for a status is usually a magnitudes slower than 
just computing it before submission.

Regards,
Christian.

If the callback
returns false then by definition the fence wasn't yet signaled at some
point during its execution (because the resources weren't yet freed),
and since it would be in the wait_event_interruptible() check path, by
definition the fence signaling at any point during or after the check
would cause the thread to wake up again and re-check.

Thread 1                                          Thread 2
1. wait_event_interruptible() arms wq             1. Free resources
2. can_run_job() checks resources                 2. Signal fence
3. wait_event_interruptible() sleeps on wq        3. Fence wakes up wq
4. loop

There is no possible interleaving of those sequences that leads to a
lost event and the thread not waking up:
- If T2.3 happens before T1.1, that means T2.1 happened earlier and T1.2
must return true.
- If T2.3 happens after T1.1 but before T1.3, the wq code will ensure
the wq does not sleep (or immediately wakes up) at T1.3 since it was
signaled during the condition check, after the wq was armed. At the next
check loop, T1.2 will then return true, since T2.1 already happened
before T2.3.
- If T2.3 happens during T1.3, the wq wakes up normally and does another
check, and at that point T1.2 returns true.

QED.

~~ Lina