On Thu, Jun 24, 2021 at 7:56 PM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 24.06.21 um 19:43 schrieb Daniel Vetter: > > On Thu, Jun 24, 2021 at 7:38 PM Christian König > > <christian.koenig@xxxxxxx> wrote: > >> Am 24.06.21 um 19:29 schrieb Daniel Vetter: > >>> On Thu, Jun 24, 2021 at 07:03:10PM +0200, Christian König wrote: > >>>> Am 24.06.21 um 16:00 schrieb Daniel Vetter: > >>>>> This is essentially part of drm_sched_dependency_optimized(), which > >>>>> only amdgpu seems to make use of. Use it a bit more. > >>>>> > >>>>> Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxxx> > >>>>> Cc: "Christian König" <christian.koenig@xxxxxxx> > >>>>> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> > >>>>> Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > >>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > >>>>> Cc: Alex Deucher <alexander.deucher@xxxxxxx> > >>>>> Cc: Jack Zhang <Jack.Zhang1@xxxxxxx> > >>>>> --- > >>>>> drivers/gpu/drm/scheduler/sched_main.c | 7 +++++++ > >>>>> 1 file changed, 7 insertions(+) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > >>>>> index 370c336d383f..c31d7cf7df74 100644 > >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c > >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c > >>>>> @@ -649,6 +649,13 @@ int drm_sched_job_await_fence(struct drm_sched_job *job, > >>>>> if (!fence) > >>>>> return 0; > >>>>> + /* if it's a fence from us it's guaranteed to be earlier */ > >>>>> + if (fence->context == job->entity->fence_context || > >>>>> + fence->context == job->entity->fence_context + 1) { > >>>>> + dma_fence_put(fence); > >>>>> + return 0; > >>>>> + } > >>>>> + > >>>> Well NAK. That would break Vulkan. > > I'm assuming your reply means the NAK is retracted and was just the > > usual "this doesn't perfectly fit for amdgpu" reflex? > > Well rather "NAK, you haven't considered that special handling in amdgpu > and if you really want to unify this you need that as well." > > > > >>>> The problem is that Vulkan can insert dependencies between jobs which run on > >>>> the same queue. > >>>> > >>>> So we need to track those as well and if the previous job for the same > >>>> queue/scheduler is not yet finished a pipeline synchronization needs to be > >>>> inserted. > >>>> > >>>> That's one of the reasons we wasn't able to unify the dependency handling > >>>> yet. > >>> That sounds like an extremely amdgpu specific constraint? > >> Yeah, that's totally hardware specific. > >> > >> It's just that I don't know how else we could track that without having > >> the same separation as in amdgpu between implicit and explicit fences. > >> And as far as I understand it that's exactly what you want to avoid. > >> > >> As I said this turned out to be really awkward. > >> > >>> You're also the > >>> only one who keeps track of whether the previous job we've scheduled has > >>> finished already (I guess they can get pipelined and you don't flush by > >>> default), so you insert fences. > >> Yes, exactly that. > >> > >>> I guess we can add a await_fence_no_dedup or so for amdgpu, but I'm not > >>> sure why we have to inflict this design constraint on all other drivers? > >>> At least I'm not seeing anything in lima, panfrost, v3d or entaviv that > >>> would break with this, and i915 will also be perfectly fine. > >>> > >>> Also note: I'm not using this for amdgpu, exactly because there's a few > >>> funny things going on. > >> Yeah, exactly the reason why we never unified this. > > Yeah there's clear limits to this, because you also can't use the > > await_implicit helper, because you have to keep filtering for owner or > > the current amdgpu uapi goes horribly slow. I think the benefit would > > be just that we could share the datastructure and the book-keeping, > > but aside from that you'd need your own integration in amdgpu. > > Yeah, but that is trivial. The _add_dependency() function (or however we > want to call it) needs to be exported anyway for adding fences from > syncfile and syncobj. > > Or do you also want to unify the handling for those? I guess we could add some convenience wrapper that pulls in a sync_file or sync_objc automatically. But there's not that much code involved there, and it's also not tricky. Also drivers might need to add dependencies for whatever anyway. The await_implicit is a bit different, because that defines how implicit sync is supposed to work. I guess the bikeshed then boils down to which one is the simple await_fence() function. The one that filters for same timeline, or the one that doesnt. I'd make the non-filtering one the special case so that amdgpu sticks out a bit more - out of 6 drivers with schedulers (i915 included) it seems to be the special one. > > One idea I just had was whether we could use the tag bits xarray has > > for the amdgpu purposed. Like we could do a > > drm_sched_job_await_fence_tagged, where you supply additional > > information (like the "this might be relevant for the vm_flush" and > > things like that). Afaiui xarray tags are very fast to enumerate on if > > you're looking for specific tags, but I might be wrong. Ideally this > > would avoid the need for the duplicated amdgpu_job->sched. > > That could work. > > Essentially we just need the information from the scheduler which is the > last fence which was dependency optimized. > > In other words when you push jobs like those to the same scheduler > > J1 > J2 -> depends on J1. > J3 -> depends on whatever, but not j2 > > The hardware needs to insert a flush between J2 and J1, but not between > j3 and j2. > > This makes roughly 19% performance difference for some OpenGL games and > incorrect rendering for Vulkan if you mess it up either way or the other. Yeah that's massive. On i915 "too many pipeline stalls" even within batches is a lot less, so we never bothered with this at all. -Daniel > > Regards, > Christian. > > > > > > Cheers, Daniel > > > > > >> Regards, > >> Christian. > >> > >>> Finally: You _really_ need explicit dependency handling for vulkan in your > >>> uapi, instead of the kernel second-guessing what userspace might be doing. > >>> That's really not how vulkan is designed to work :-) > >>> Cheers, Daniel > >>> > >>> > >>>> Christian. > >>>> > >>>>> /* Deduplicate if we already depend on a fence from the same context. > >>>>> * This lets the size of the array of deps scale with the number of > >>>>> * engines involved, rather than the number of BOs. > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch