On Wed, May 19, 2021 at 1:24 PM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > Am 18.05.21 um 23:17 schrieb Daniel Vetter: > > [SNIP] > >> The problem in this case is not starting a new CS, but synchronizing to > >> the existing ones. > >> > >> See a heavy TLB flush is made completely out of sync. E.g. it doesn't > >> want to wait for any previous operation. > >> > >> In other words imagine the following example: > >> 1. Both process A and B have a BO mapped. > >> 2. Process A is heavily using the BO and doing all kind of rendering. > >> 3. Process B is unmapping the BO. > >> > >> Now that process B unmaps the BO needs to trigger page table updates and > >> a heavy TLB flush, but since this can take really long we want to do it > >> asynchronously on the hardware. > >> > >> With the current approach you basically can't do that because you can't > >> note that a fence should not participate in synchronization at all. > >> > >> E.g. we can't add a fence which doesn't wait for the exclusive one as > >> shared. > > Ok I think that's a real problem, and guess it's also related to all > > the ttm privatization tricks and all that. So essentially we'd need > > the opposite of ttm_bo->moving, as in you can't ignore it, but > > otherwise it completely ignores all the userspace implicit fence > > stuff. > > It goes into that direction, but doesn't sounds like the full solution > either. > > [SNIP] > > Can we please stop with the "amdgpu is right, everyone else is wrong" approach? > > Well the approach I do here is not "amdgpu is right, everyone else is > wrong". But rather we had DRM uAPI for i915, nouveau and radeon and > unfortunately leaked that into DMA-buf without much thinking about it. > > I'm also not saying that the approach amdgpu is right. It's just what > amdgpu needs in it's CS interface. > > What I'm saying is that DMA-buf is a device driver independent subsystem > and we shouldn't make any assumption which come from just a handful of > DRM driver on it's implicit sync implementation. > > > Like I'm pretty much going to type up the patch that does a full drm > > subsytem audit of everything and whack amdgpu into compliance. Perf > > hit be damned, you had a few years to fix this with better uapi. Or I > > find out that there's a giant inconsistent mess, but at least we'd > > gain some clarity about where exactly we are here and maybe what to do > > next. > > Ok to let us move forward please take a look at the first patches of the > set. It cleans up quite a bunch of the mess we have in there before even > coming to adding flags to the shared slots. > > I think you will agree on that we should do is cleaning up the use cases > further and separate implicit sync from resource management. Just replying on this because I'm a bit busy with reviewing everything we have in upstream right now. I agree there's some useful stuff in there, but we have a fundamental disagreement on how this works. That needs to be resolved first, and as part of that we need to come up with a plan how to get everyone on the same page. Then next thing is a plan how to get the various issues you're raising around dma_resv rules sorted out. Once we have that, and only then, does it imo make sense to review/merge cleanup patches. As long as we have fundamental disagreements along the lines like we have here there's no point. I should have a patch set maybe tomorrow or early next week with my results of the drm subsystem review of how exactly dma_resv is used currently. Thus far it's a few pages of code analysis, but not yet complete. Also I found some smaller issues in a few places, so the discussion is going to involve a few more people until we're settled here :-/ Cheers, Daniel > In other words we forbid touching the exclusive and shared fences > directly and have separate APIs for resource management and implicit sync. > > This makes sense anyway, no matter what implicit synchronization > framework we will install underneath. > > Regards, > Christian. > > > -Daniel > > > >> Regards, > >> Christian. > >> > >>> After that I think we can look at what exact oversync issue remains > >>> and why and solve it, but until we have this this just feels like > >>> another rehash of "amgpu insist its own dma_resv interpration is the > >>> right one and everyone else should move one over". > >>> > >>> Or maybe I've just become real garbage at reading random driver code, > >>> wouldn't be the first time :-) > >>> > >>> Cheers, Daniel > >>> > >>>> Regards, > >>>> Christian. > >>>> > >>>>> Cheers, Daniel > >>>>> > >>>>>> --Jason > >>>>>> > >>>>>> > >>>>>>>> That's also the reason the Valve guys came up with a solution where each > >>>>>>>> BO gets a flag for explicit sync, but that only works for exports and > >>>>>>>> not for imports. > >>>>>>>> > >>>>>>>>> I915 and iirc msm has explicit flags for this, panfrost was designed to > >>>>>>>>> support this correctly from the start (also with flags I think). That's at > >>>>>>>>> least what I remember from all the discussions at XDC and #dri-devel, but > >>>>>>>>> didn't check the code again to give you the list of uapi flags you need > >>>>>>>>> for each driver. > >>>>>>>>> > >>>>>>>>> The other piece is making sure you're only picking up implicit fences when > >>>>>>>>> you should, and not any later ones, for which Jason has a solution: > >>>>>>>>> > >>>>>>>>> https://lore.kernel.org/dri-devel/20210317221940.2146688-1-jason@xxxxxxxxxxxxxx/ > >>>>>>>> Yes, I helped with that as well. But I think that this is just another > >>>>>>>> workaround without really addressing the underlying problem. > >>>>>>>> > >>>>>>>>> If amdgpu isn't using those, then you will suffer from > >>>>>>>>> over-synchronization in vulkan and pay a price. The entire point of vulkan > >>>>>>>>> is that you pick up sync points very explicitly, and we also need to have > >>>>>>>>> very explicit uapi for userspace to pick up/set the implicit fences. > >>>>>>>>> > >>>>>>>>> Trying to paper over this with more implicit magic is imo just wrong, and > >>>>>>>>> definitely not the long term explicit sync model we want. > >>>>>>>> I completely disagree. > >>>>>>>> > >>>>>>>> In my opinion the implicit sync model we have for dma_resv currently is > >>>>>>>> just not well designed at all, since it always requires cooperation from > >>>>>>>> userspace. > >>>>>>>> > >>>>>>>> In other words you need to know when to enable implicit sync in > >>>>>>>> userspace and that information is simply not present all of the time. > >>>>>>>> > >>>>>>>> What we have done here is just keeping the old reader/writer flags i915, > >>>>>>>> radeon and nouveau once had and pushed that out to everybody else making > >>>>>>>> the assumption that everybody would follow that without documenting the > >>>>>>>> actual rules of engagement you need to follow here. > >>>>>>>> > >>>>>>>> That was a really big mistake and we should try to fix that sooner or > >>>>>>>> later. The only other clean alternative I see is to use a flag on the > >>>>>>>> exporter to tell the importer if it should sync to shared fences or not. > >>>>>>>> > >>>>>>>> Additional to that I'm perfectly fine with implicit sync. Explicit sync > >>>>>>>> certainly has some use cases as well, but I don't see it as an absolute > >>>>>>>> advantage over the implicit model. > >>>>>>> Ok this stops making sense. Somehow you claim userspace doesn't know > >>>>>>> when to sync, but somehow the kernel does? By guessing, and getting it > >>>>>>> wrong mostly, except for the one case that you benchmarked? > >>>>>>> > >>>>>>> Aside from silly userspace which exports a buffer to a dma-buf, but > >>>>>>> then never imports it anywhere else, there isn't a case I know of > >>>>>>> where the kernel actually knows more than userspace. But there's lots > >>>>>>> of cases where the kernel definitely knows less, especially if > >>>>>>> userspace doesn't tell it about what's going on with each rendering > >>>>>>> and buffer. > >>>>>>> > >>>>>>> So here's the 2 things you need to make this work like every other driver: > >>>>>>> > >>>>>>> 1. A way to set the explicit fence on a buffer. CS ioctl is perfectly > >>>>>>> fine, but also can be seperate. Userspace uses this only on a) shared > >>>>>>> buffers b) when there's a flush/swap on that shared buffer. Not when > >>>>>>> rendering any of the interim stuff, that only leads to oversync. > >>>>>>> Anything non-shared is handled explicitly in userspace (at least for > >>>>>>> modern-ish drivers). This is the only thing that ever sets an > >>>>>>> exclusive fence (aside from ttm moving buffers around ofc). > >>>>>>> > >>>>>>> 2. A way to sync with the implicit fences, either all of them (for > >>>>>>> upcoming write access) or just the write fence (for read access). At > >>>>>>> first we thought it's good enough to do this in the CS ioctl, but > >>>>>>> that's a wee bit too late, hence the patches from Jason. My > >>>>>>> understanding is that vulkan converts this into an vk syncobj/fence of > >>>>>>> some sorts, so really can't make this more explicit and intentional > >>>>>>> than that. > >>>>>>> > >>>>>>> None of this is something the kernel has the slightest idea about when > >>>>>>> it happens, so you have to have explicit uapi for it. Trying to fake > >>>>>>> it in the kernel just doesn't work. > >>>>>>> -Daniel > >>>>>>> -- > >>>>>>> Daniel Vetter > >>>>>>> Software Engineer, Intel Corporation > >>>>>>> http://blog.ffwll.ch > >>>>> -- > >>>>> Daniel Vetter > >>>>> Software Engineer, Intel Corporation > >>>>> http://blog.ffwll.ch > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch