Am 08.08.19 um 09:29 schrieb Daniel Vetter: > On Thu, Aug 8, 2019 at 9:09 AM Koenig, Christian > <Christian.Koenig@xxxxxxx> wrote: >> Am 07.08.19 um 23:19 schrieb Daniel Vetter: >>> On Wed, Jul 31, 2019 at 10:55:02AM +0200, Daniel Vetter wrote: >>>> On Thu, Jun 27, 2019 at 09:28:11AM +0200, Christian König wrote: >>>>> Hi Daniel, >>>>> >>>>> those fails look like something random to me and not related to my patch >>>>> set. Correct? >>>> First one I looked at has the reservation_obj all over: >>>> >>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_exec_fence@xxxxxxxxxxxxxxxxxxxxxxx >>>> >>>> So 5 second guees is ... probably real? >>>> >>>> Note that with the entire lmem stuff going on right now there's massive >>>> discussions about how we're doing resv_obj vs obj->mm.lock the wrong way >>>> round in i915, so I'm not surprised at all that you managed to trip this. >>>> >>>> The way I see it right now is that obj->mm.lock needs to be limited to >>>> dealing with the i915 shrinker interactions only, and only for i915 native >>>> objects. And for dma-bufs we need to make sure it's not anywhere in the >>>> callchain. Unfortunately that's a massive refactor I guess ... >>> Thought about this some more, aside from just breaking i915 or waiting >>> until it's refactored (Both not awesome) I think the only option is get >>> back to the original caching. And figure out whether we really need to >>> take the direction into account for that, or whether upgrading to >>> bidirectional unconditionally won't be ok. I think there's only really two >>> cases where this matters: >>> >>> - display drivers using the cma/dma_alloc helpers. Everything is allocated >>> fully coherent, cpu side wc, no flushing. >>> >>> - Everyone else (on platforms where there's actually some flushing going >>> on) is for rendering gpus, and those always map bidirectional and want >>> the mapping cached for as long as possible. >>> >>> With that we could go back to creating the cached mapping at attach time >>> and avoid inflicting the reservation object lock to places that would keel >>> over. >>> >>> Thoughts? >> Actually we had a not so nice internal mail thread with our hardware >> guys and it looks like we have tons of hardware bugs/exceptions that >> sometimes PCIe BARs are only readable or only writable. So it turned out >> that always caching with bidirectional won't work for us either. >> >> Additional to that I'm not sure how i915 actually triggered the issue, >> cause with the current code that shouldn't be possible. >> >> But independent of that I came to the conclusion that we first need to >> get to a common view of what the fences in the reservation mean or >> otherwise the whole stuff here isn't going to work smooth either. >> >> So working on that for now and when that's finished I will come back to >> this problem here again. > Yeah makes sense. I think we also need to clarify a bit the existing > rules around reservatrion_object, dma_fence signaling, and how that > nests with everything else (like memory allocation/fs_reclaim critical > sections, or mmap_sem). > > Ignore the drivers which just pin everything system memory (mostly > just socs) I think we have a bunch of groups, and they're all somewhat > incompatible with each another. Examples: > > - old ttm drivers (anything except amdgpu) nest the mmap_sem within > the reservation_object. That allows you to do copy_*_user while > holding reservations, simplifying command submission since you don't > need fallback paths when you take a fault. But means you have this > awkward trylock in the mmap path with no forward progress guarantee at > all. > > amdgpu fixed that (but left ttm alone), i915 also works like that with > mmap_sem being the outer lock. By the way that is incorrect. Both amdgpu as well as readeon don't use copy_to/from_user while holding the reservation lock. The last time I checked the only driver still doing that was nouveau. Maybe time to add a might_lock() so that we will be informed about misuse by lockdep? Christian. > > - other is reservation_object vs memory allocations. Currently all > drivers assume you can allocate memory while holding a reservation, > but i915 gem folks seem to have some plans to change that for i915. > Which isn't going to work I think, so we need to clarify that before > things get more inconsistent. > > Above two can at least be ensured by adding somme lockdep annotations > and dependency priming, see i915_gem_shrinker_taints_mutex for what I > have in mind for reservation_obj. > > The real pain/scary thing is dma_fence. All the > shrinkers/mmu_notifiers/hmm_mirrors we have assume that you can wait > for a fence from allocation contexts/direct reclaim. Which means > nothing you do between publishing a fence somewhere (dma-buf, syncobj, > syncpt fd) and signalling that fence is allowed to allocate memory or > pull in any dependencies which might need memory allocations. I think > we're mostly ok with this, but there's some i915 patches that break > this. > > Much worse is that lockdep can't help us check this: dma_fence is > essentially struct completion on steroids, and the cross-release > lockdep support for struct completion looks like it's never going to > get merged. So no debugging aids to make sure we get this right, all > we have is review and testing and machines deadlocking in really > complicated ways if we get it wrong. > > Cheers, Daniel _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx