On Thu, Jan 14, 2021 at 10:26 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > On Thu, Jan 14, 2021 at 4:27 AM Jerome Glisse <jglisse@xxxxxxxxxx> wrote: > > > > On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote: > > > On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse <jglisse@xxxxxxxxxx> wrote: > > > > On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote: > > > > > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote: > > > > > > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter: > > > > > > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote: > > > > > > >> This is the first version of our HMM based shared virtual memory manager > > > > > > >> for KFD. There are still a number of known issues that we're working through > > > > > > >> (see below). This will likely lead to some pretty significant changes in > > > > > > >> MMU notifier handling and locking on the migration code paths. So don't > > > > > > >> get hung up on those details yet. > > > > > > >> > > > > > > >> But I think this is a good time to start getting feedback. We're pretty > > > > > > >> confident about the ioctl API, which is both simple and extensible for the > > > > > > >> future. (see patches 4,16) The user mode side of the API can be found here: > > > > > > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c > > > > > > >> > > > > > > >> I'd also like another pair of eyes on how we're interfacing with the GPU VM > > > > > > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), > > > > > > >> and some retry IRQ handling changes (32). > > > > > > >> > > > > > > >> > > > > > > >> Known issues: > > > > > > >> * won't work with IOMMU enabled, we need to dma_map all pages properly > > > > > > >> * still working on some race conditions and random bugs > > > > > > >> * performance is not great yet > > > > > > > Still catching up, but I think there's another one for your list: > > > > > > > > > > > > > > * hmm gpu context preempt vs page fault handling. I've had a short > > > > > > > discussion about this one with Christian before the holidays, and also > > > > > > > some private chats with Jerome. It's nasty since no easy fix, much less > > > > > > > a good idea what's the best approach here. > > > > > > > > > > > > Do you have a pointer to that discussion or any more details? > > > > > > > > > > Essentially if you're handling an hmm page fault from the gpu, you can > > > > > deadlock by calling dma_fence_wait on a (chain of, possibly) other command > > > > > submissions or compute contexts with dma_fence_wait. Which deadlocks if > > > > > you can't preempt while you have that page fault pending. Two solutions: > > > > > > > > > > - your hw can (at least for compute ctx) preempt even when a page fault is > > > > > pending > > > > > > > > > > - lots of screaming in trying to come up with an alternate solution. They > > > > > all suck. > > > > > > > > > > Note that the dma_fence_wait is hard requirement, because we need that for > > > > > mmu notifiers and shrinkers, disallowing that would disable dynamic memory > > > > > management. Which is the current "ttm is self-limited to 50% of system > > > > > memory" limitation Christian is trying to lift. So that's really not > > > > > a restriction we can lift, at least not in upstream where we need to also > > > > > support old style hardware which doesn't have page fault support and > > > > > really has no other option to handle memory management than > > > > > dma_fence_wait. > > > > > > > > > > Thread was here: > > > > > > > > > > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@xxxxxxxxxxxxxx/ > > > > > > > > > > There's a few ways to resolve this (without having preempt-capable > > > > > hardware), but they're all supremely nasty. > > > > > -Daniel > > > > > > > > > > > > > I had a new idea, i wanted to think more about it but have not yet, > > > > anyway here it is. Adding a new callback to dma fence which ask the > > > > question can it dead lock ? Any time a GPU driver has pending page > > > > fault (ie something calling into the mm) it answer yes, otherwise > > > > no. The GPU shrinker would ask the question before waiting on any > > > > dma-fence and back of if it gets yes. Shrinker can still try many > > > > dma buf object for which it does not get a yes on associated fence. > > > > > > Having that answer on a given fence isn't enough, you still need to > > > forward that information through the entire dependency graph, across > > > drivers. That's the hard part, since that dependency graph is very > > > implicit in the code, and we'd need to first roll it out across all > > > drivers. > > > > Here i am saying do not wait on fence for which you are not sure. > > Only wait on fence for which you are 100% certain you can not dead > > lock. So if you can never be sure on dma fence then never wait on > > dma-fence in the shrinker. However most driver should have enough > > information in their shrinker to know if it is safe to wait on > > fence internal to their device driver (and also know if any of > > those fence has implicit outside dependency). So first implementation > > would be to say always deadlock and then having each driver build > > confidence into what it can ascertain. > > I just don't think that actually works in practice: > > - on a single gpu you can't wait for vk/gl due to shared CUs, so only > sdma and uvd are left (or whatever else pure fixed function) > > - for multi-gpu you get the guessing game of what leaks across gpus > and what doesn't. With p2p dma-buf we're now leaking dma_fence across > gpus even when there's no implicit syncing by userspace (although for > amdgpu this is tricky since iirc it still lacks the flag to let > userspace decide this, so this is more for other drivers). > > - you don't just need to guarantee that there's no dma_fence > dependency going back to you, you also need to make sure there's no > other depedency chain through locks or whatever that closes the loop. > And since your proposal here is against the dma_fence lockdep > annotations we have now, lockdep won't help you (and let's be honest, > review doesn't catch this stuff either, so it's up to hangs in > production to catch this stuff) > > - you still need the full dependency graph within the driver, and only > i915 scheduler has that afaik. And I'm not sure implementing that was > a bright idea > > - assuming it's a deadlock by default means all gl/vk memory is > pinned. That's not nice, plus in additional you need hacks like ttm's > "max 50% of system memory" to paper over the worst fallout, which > Christian is trying to lift. I really do think we need to be able to > move towards more dynamic memory management, not less. Forgot one issue: - somehow you need to transport the knowledge that you're in the gpu fault repair path of a specific engine down to shrinkers/mmu notifiers and all that. And it needs to be fairly specific, otherwise it just amounts again to "no more dma_fence_wait allowed". -Daniel > So in the end you're essentially disabling shrinking/eviction of other > gpu tasks, and I don't think that works. I really think the only two > realistic options are > - guarantee forward progress of other dma_fence (hw preemption, > reserved CUs, or whatever else you have) > - guarantee there's not a single offending dma_fence active in the > system that could cause problems > > Hand-waving that in theory we could track the dependencies and that in > theory we could do some deadlock avoidance of some sorts about that > just doesn't look like a pragmatic&practical solution to me here. It > feels about as realistic as just creating a completely new memory > management model that sidesteps the entire dma_fence issues we have > due to mixing up kernel memory management and userspace sync fences in > one thing. > > Cheers, Daniel > > > > > This does not solve the mmu notifier case, for this you would just > > > > invalidate the gem userptr object (with a flag but not releasing the > > > > page refcount) but you would not wait for the GPU (ie no dma fence > > > > wait in that code path anymore). The userptr API never really made > > > > the contract that it will always be in sync with the mm view of the > > > > world so if different page get remapped to same virtual address > > > > while GPU is still working with the old pages it should not be an > > > > issue (it would not be in our usage of userptr for compositor and > > > > what not). > > > > > > > > Maybe i overlook something there. > > > > > > tbh I'm never really clear on how much exactly we need, and whether > > > maybe the new pin/unpin api should fix it all. > > > > pin/unpin is not a solution it is to fix something with GUP (where > > we need to know if a page is GUPed or not). GUP should die longterm > > so anything using GUP (pin/unpin falls into that) should die longterm. > > Pining memory is bad period (it just breaks too much mm and it is > > unsolvable for things like mremap, splice, ...). > > > > Cheers, > > Jérôme > > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel