Re: [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 08.03.23 um 13:39 schrieb Karol Herbst:
On Wed, Mar 8, 2023 at 9:46 AM Christian König <christian.koenig@xxxxxxx> wrote:
Am 07.03.23 um 15:25 schrieb Asahi Lina:
Some hardware may require more complex resource utilization accounting
than the simple job count supported by drm_sched internally. Add a
can_run_job callback to allow drivers to implement more logic before
deciding whether to run a GPU job.
Well complete NAK.

There hasn't even been any kind of discussion yet you already come
around with a "Well complete NAK"

First, this can be seen as rude behavior and me being part of the drm
community I don't want to have to see this kind of thing.

Obviously, any kind of strong "technical" review point is a nak until
people settle with an agreement on what to land, there is no point in
pointing out a "NAK", especially if that's the first thing you say. If
you want to express your strong disagreement with the proposed
solution, then state what your pain points are directly.

If there is a long discussion and a maintainer feels it's going
nowhere and no conclusion will be reached it might be this kind of
"speaking with authority" point has to be made. But not as the starter
into a discussion. This is unnecessarily hostile towards the
contributor. And I wished we wouldn't have to see this kind of
behavior here.

Yes, some kernel maintainers do this a lot, but kernel maintainers
also have this kind of reputation and people don't want to have to
deal with this nonsense and decide to not contribute at all. So please
just drop this attitude.

Yes, you are completely right with that, but getting this message to the recipient is intentional on my side.

I give completely NAKs when the author of a patch has missed such a fundamental technical connection that further discussion doesn't make sense.

It's not meant to be in any way rude or offending. I can put a smiley behind it if it somehow helps, but we still need a way to raise this big red stop sign.

This is clearly going against the idea of having jobs only depend on
fences and nothing else which is mandatory for correct memory management.

I'm sure it's all documented and there is a design document on how
things have to look like you can point out? Might help to get a better
understanding on how things should be.

Yeah, that's the problematic part. We have documented this very extensively: https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences

And both Jason and Daniel gave talks about the underlying problem and try to come up with patches to raise warnings when that happens, but people still keep coming up with the same idea over and over again.

It's just that the technical relationship between preventing jobs from running and with that preventing dma_fences from signaling and the core memory management with page faults and shrinkers waiting for those fences is absolutely not obvious.

We had at least 10 different teams from different companies falling into the same trap already and either the patches were rejected of hand or had to painfully reverted or mitigated later on.

Regards,
Christian.


If the hw is busy with something you need to return the fence for this
from the prepare_job callback so that the scheduler can be notified when
the hw is available again.

Regards,
Christian.

Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx>
---
   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
   include/drm/gpu_scheduler.h            |  8 ++++++++
   2 files changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 4e6ad6e122bc..5c0add2c7546 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
               if (!entity)
                       continue;

+             if (sched->ops->can_run_job) {
+                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
+                     if (!sched_job) {
+                             complete_all(&entity->entity_idle);
+                             continue;
+                     }
+                     if (!sched->ops->can_run_job(sched_job))
+                             continue;
+             }
+
               sched_job = drm_sched_entity_pop_job(entity);

               if (!sched_job) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 9db9e5e504ee..bd89ea9507b9 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
       struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
                                        struct drm_sched_entity *s_entity);

+     /**
+      * @can_run_job: Called before job execution to check whether the
+      * hardware is free enough to run the job.  This can be used to
+      * implement more complex hardware resource policies than the
+      * hw_submission limit.
+      */
+     bool (*can_run_job)(struct drm_sched_job *sched_job);
+
       /**
            * @run_job: Called to execute the job once all of the dependencies
            * have been resolved.  This may be called multiple times, if





[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux