On Wed, Apr 05, 2023 at 04:14:11PM +0200, Christian König wrote: > Am 05.04.23 um 15:40 schrieb Daniel Vetter: > > On Tue, Mar 07, 2023 at 11:25:35PM +0900, Asahi Lina wrote: > > > Some hardware may require more complex resource utilization accounting > > > than the simple job count supported by drm_sched internally. Add a > > > can_run_job callback to allow drivers to implement more logic before > > > deciding whether to run a GPU job. > > > > > > Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx> > > Ok scheduler rules, or trying to summarize the entire discussion: > > > > dma_fence rules are very tricky. The two main chapters in the docs are > > > > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#dma-fence-cross-driver-contract > > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences > > > > Unforutunately I don't think it's possible to check this at compile time, > > thus far all we can do is validate at runtime. I've posted two patches for > > this: > > > > https://lore.kernel.org/dri-devel/20201023122216.2373294-17-daniel.vetter@xxxxxxxx/ > > https://lore.kernel.org/dri-devel/20201023122216.2373294-20-daniel.vetter@xxxxxxxx/ > > > > Unfortunately most drivers are buggy and get this completely wrong, so > > realistically we'd need to make this a per-driver opt-out and annotate all > > current drivers. Well except amdgpu is correct by now I think (they'd > > still need to test that). > > There is still one potential memory allocation in the run_job callback in > amdgpu which I wasn't able to fix yet. > > But that one is purely academic and could potentially be trivially replaced > with using GFP_ATOMIC if we ever have to. I think the modeset in the tdr code was more scary, and I'm not sure you really managed to get rid of absolutely everything in there yet. -Daniel > > Christian. > > > And Rob Clark is working on patches to fix up > > msm. > > > > I think best here is if you work together with Rob to make sure these > > annotations are mandatory for any rust drivers (I don't want new buggy > > drivers at least). Would also be great to improve the kerneldoc for all > > the driver hooks to explain these restrictions and link to the relevant > > kerneldocs (there's also one for the dma_fence signalling annotations > > which might be worth linking too). > > > > I don't see any way to make this explicit in rust types, it's really only > > something runtime tests (using lockdep) can catch. Somewhat disappointing. > > > > For the other things discussed here: > > > > - Option<Dma_Fence> as the return value for ->prepare_job makes sense to > > me. > > > > - I don't see any way a driver can use ->can_run_job without breaking the > > above rules, that really doesn't sound like a good idea to me. > > > > Cheers, Daniel > > > > > --- > > > drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ > > > include/drm/gpu_scheduler.h | 8 ++++++++ > > > 2 files changed, 18 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > index 4e6ad6e122bc..5c0add2c7546 100644 > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) > > > if (!entity) > > > continue; > > > + if (sched->ops->can_run_job) { > > > + sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue)); > > > + if (!sched_job) { > > > + complete_all(&entity->entity_idle); > > > + continue; > > > + } > > > + if (!sched->ops->can_run_job(sched_job)) > > > + continue; > > > + } > > > + > > > sched_job = drm_sched_entity_pop_job(entity); > > > if (!sched_job) { > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > > index 9db9e5e504ee..bd89ea9507b9 100644 > > > --- a/include/drm/gpu_scheduler.h > > > +++ b/include/drm/gpu_scheduler.h > > > @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { > > > struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, > > > struct drm_sched_entity *s_entity); > > > + /** > > > + * @can_run_job: Called before job execution to check whether the > > > + * hardware is free enough to run the job. This can be used to > > > + * implement more complex hardware resource policies than the > > > + * hw_submission limit. > > > + */ > > > + bool (*can_run_job)(struct drm_sched_job *sched_job); > > > + > > > /** > > > * @run_job: Called to execute the job once all of the dependencies > > > * have been resolved. This may be called multiple times, if > > > > > > -- > > > 2.35.1 > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch