On Wed, Mar 8, 2023 at 4:09 PM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 08.03.23 um 15:43 schrieb Karol Herbst: > > [SNIP] > > "further"? There was no discussion at all, > > Yeah, well that is exactly what I wanted to archive. > > > you just started off like > > that. If you think somebody misses that connection, you can point out > > to documentation/videos whatever so the contributor can understand > > what's wrong with an approach. You did that, so that's fine. It's just > > starting off _any_ discussion with a "Well complete NAK" is terrible > > style. I'd feel uncomfortable if that happened to me and I'm sure > > there are enough people like that that we should be more reasonable > > with our replies. Just.. don't. > > > > We are all humans here and people react negatively to such things. And > > if people do it on purpose it just makes it worse. > > I completely see your point, I just don't know how to improve it. > > I don't stop people like this because I want to make them uncomfortable > but because I want to prevent further discussions on that topic. > > In other words how can I make people notice that this is something > fundamental while still being polite? > I think a little improvement over this would be to at least wait a few replies before resorting to those strong statements. Just before it becomes a risk in just wasting time. > >>>> This is clearly going against the idea of having jobs only depend on > >>>> fences and nothing else which is mandatory for correct memory management. > >>>> > >>> I'm sure it's all documented and there is a design document on how > >>> things have to look like you can point out? Might help to get a better > >>> understanding on how things should be. > >> Yeah, that's the problematic part. We have documented this very > >> extensively: > >> https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences > >> > >> And both Jason and Daniel gave talks about the underlying problem and > > fyi: > > s/Jason/Faith/g > > +1. I wasn't aware of that. > > >> try to come up with patches to raise warnings when that happens, but > >> people still keep coming up with the same idea over and over again. > >> > > Yes, and we'll have to tell them over and over again. Nothing wrong > > with that. That's just part of maintaining such a big subsystem. And > > that's definitely not a valid reason to phrase things like above. > > > >> It's just that the technical relationship between preventing jobs from > >> running and with that preventing dma_fences from signaling and the core > >> memory management with page faults and shrinkers waiting for those > >> fences is absolutely not obvious. > >> > >> We had at least 10 different teams from different companies falling into > >> the same trap already and either the patches were rejected of hand or > >> had to painfully reverted or mitigated later on. > >> > > Sure, but that's just part of the job. And pointing out fundamental > > mistakes early on is important, but the situation won't get any better > > by being like that. Yes, we'll have to repeat the same words over and > > over again, and yes that might be annoying, but that's just how it is. > > Well I have no problem explaining people why a solution doesn't work. > > But what usually happens is that people don't realize that they need to > back of from a design and completely start over. > > Regards, > Christian. > > > > >> Regards, > >> Christian. > >> > >>>> If the hw is busy with something you need to return the fence for this > >>>> from the prepare_job callback so that the scheduler can be notified when > >>>> the hw is available again. > >>>> > >>>> Regards, > >>>> Christian. > >>>> > >>>>> Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx> > >>>>> --- > >>>>> drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ > >>>>> include/drm/gpu_scheduler.h | 8 ++++++++ > >>>>> 2 files changed, 18 insertions(+) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > >>>>> index 4e6ad6e122bc..5c0add2c7546 100644 > >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c > >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c > >>>>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) > >>>>> if (!entity) > >>>>> continue; > >>>>> > >>>>> + if (sched->ops->can_run_job) { > >>>>> + sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue)); > >>>>> + if (!sched_job) { > >>>>> + complete_all(&entity->entity_idle); > >>>>> + continue; > >>>>> + } > >>>>> + if (!sched->ops->can_run_job(sched_job)) > >>>>> + continue; > >>>>> + } > >>>>> + > >>>>> sched_job = drm_sched_entity_pop_job(entity); > >>>>> > >>>>> if (!sched_job) { > >>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > >>>>> index 9db9e5e504ee..bd89ea9507b9 100644 > >>>>> --- a/include/drm/gpu_scheduler.h > >>>>> +++ b/include/drm/gpu_scheduler.h > >>>>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { > >>>>> struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, > >>>>> struct drm_sched_entity *s_entity); > >>>>> > >>>>> + /** > >>>>> + * @can_run_job: Called before job execution to check whether the > >>>>> + * hardware is free enough to run the job. This can be used to > >>>>> + * implement more complex hardware resource policies than the > >>>>> + * hw_submission limit. > >>>>> + */ > >>>>> + bool (*can_run_job)(struct drm_sched_job *sched_job); > >>>>> + > >>>>> /** > >>>>> * @run_job: Called to execute the job once all of the dependencies > >>>>> * have been resolved. This may be called multiple times, if > >>>>> >