On Thu, Dec 17, 2020 at 7:09 PM Jerome Glisse <jglisse@xxxxxxxxxx> wrote: > > Adding few folks on cc just to raise awareness and so that i > could get corrected if i said anything wrong. > > On Thu, Dec 17, 2020 at 04:45:55PM +0100, Daniel Vetter wrote: > > On Thu, Dec 17, 2020 at 4:36 PM Christian König > > <christian.koenig@xxxxxxx> wrote: > > > Am 17.12.20 um 16:26 schrieb Daniel Vetter: > > > > On Thu, Dec 17, 2020 at 4:10 PM Christian König > > > > <christian.koenig@xxxxxxx> wrote: > > > >> Am 17.12.20 um 15:36 schrieb Daniel Vetter: > > > >>> On Thu, Dec 17, 2020 at 2:46 PM Christian König > > > >>> <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > > >>>> Am 16.12.20 um 16:09 schrieb Daniel Vetter: > > > >>>>> On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote: > > > >>>>> [SNIP] > > > >>>>>> + > > > >>>>>> +/* As long as pages are available make sure to release at least one */ > > > >>>>>> +static unsigned long ttm_tt_shrinker_scan(struct shrinker *shrink, > > > >>>>>> + struct shrink_control *sc) > > > >>>>>> +{ > > > >>>>>> + struct ttm_operation_ctx ctx = { > > > >>>>>> + .no_wait_gpu = true > > > >>>>> Iirc there's an eventual shrinker limit where it gets desperate. I think > > > >>>>> once we hit that, we should allow gpu waits. But it's not passed to > > > >>>>> shrinkers for reasons, so maybe we should have a second round that tries > > > >>>>> to more actively shrink objects if we fell substantially short of what > > > >>>>> reclaim expected us to do? > > > >>>> I think we should try to avoid waiting for the GPU in the shrinker callback. > > > >>>> > > > >>>> When we get HMM we will have cases where the shrinker is called from > > > >>>> there and we can't wait for the GPU then without causing deadlocks. > > > >>> Uh that doesn't work. Also, the current rules are that you are allowed > > > >>> to call dma_fence_wait from shrinker callbacks, so that shipped sailed > > > >>> already. This is because shrinkers are a less restrictive context than > > > >>> mmu notifier invalidation, and we wait in there too. > > > >>> > > > >>> So if you can't wait in shrinkers, you also can't wait in mmu > > > >>> notifiers (and also not in HMM, wĥich is the same thing). Why do you > > > >>> need this? > > > >> The core concept of HMM is that pages are faulted in on demand and it is > > > >> perfectly valid for one of those pages to be on disk. > > > >> > > > >> So when a page fault happens we might need to be able to allocate memory > > > >> and fetch something from disk to handle that. > > > >> > > > >> When this memory allocation then in turn waits for the GPU which is > > > >> running the HMM process we are pretty much busted. > > > > Yeah you can't do that. That's the entire infinite fences discussions. > > > > > > Yes, exactly. > > > > > > > For HMM to work, we need to stop using dma_fence for userspace sync, > > > > > > I was considering of separating that into a dma_fence and a hmm_fence. > > > Or something like this. > > > > The trouble is that dma_fence it all its forms is uapi. And on gpus > > without page fault support dma_fence_wait is still required in > > allocation contexts. So creating a new kernel structure doesn't really > > solve anything I think, it needs entire new uapi completely decoupled > > from memory management. Last time we've done new uapi was probably > > modifiers, and that's still not rolled out years later. > > With hmm there should not be any fence ! You do not need them. > If you feel you need them than you are doing something horribly > wrong. See below on what HMM needs and what it means. > > > > > > and you can only use the amdkfd style preempt fences. And preempting > > > > while the pagefault is pending is I thought something we require. > > > > > > Yeah, problem is that most hardware can't do that :) > > > > > > Getting page faults to work is hard enough, preempting while waiting for > > > a fault to return is not something which was anticipated :) > > > > Hm last summer in a thread you said you've blocked that because it > > doesn't work. I agreed, page fault without preempt is rather tough to > > make work. > > > > > > Iow, the HMM page fault handler must not be a dma-fence critical > > > > section, i.e. it's not allowed to hold up any dma_fence, ever. > > > > > > What do you mean with that? > > > > dma_fence_signalling_begin/end() annotations essentially, i.e. > > cross-release dependencies. Or the other way round, if you want to be > > able to allocate memory you have to guarantee that you're never > > holding up a dma_fence. > > Correct nothing regarding dma/ttm/gem should creep into HMM code > path. > > > For HMM what you want when handling GPU fault is doing it without > holding any GPU driver locks so that the regular page fault handler > code path can go back into the GPU driver (through shrinker) without > worrying about it. > > This is how nouveau does it: > - get event about page fault (might hold some GPU lock) > - walk the event buffer to get all faulting addresses > (might hold some GPU lock) > > ! DROP ALL GPU/DRIVER LOCK ! > > - try to coallesce fault together (adjacent address > trigger a fault for a single range) > - call in HMM/mmu notifier helpers to handle the fault > - take GPU lock (svmm->mutex for nouveau) > - from the fault result update GPU page table > > - Tell GPU it can resume (this can be done after releasing > the lock below, it is just up to the device driver folks > to decide when to do that). > > - unlock GPU (svmm->mutex for nouveau) > > (see nouveau_svm.c look for range_fault) > > GPU page fault should never need to rely on shrinker to succeed > into reclaiming memory. The rational here is that regular memory > reclaim (which includes regular anonymous or mmap file back > memory) will be able to make forward progress even during GPU > page fault. This of course means that while servicing a GPU page > fault you might freeup memory that was also use by the GPU and > which will re-fault again but the lru list should make that very > unlikely. > > > Note that for regular memory reclaim to make forward progress > means that the GPU _must_ support asynchronous GPU page table > update ie being able to unmap things from GPU page table and > flush GPU TLB without having to wait for anything running on > the GPU. This is where not all hardware might work. Only things > populated through HMM need to be unmapped (dma/ttm/gem should > not be considered unless they too can be unmapped without > deadlock). > > For nouveau this is done through register write (see > gf100_vmm_invalidate and gp100_vmm_invalidate_pdb and the > GPU page table itself is updated through the CPU). > > I think issue for AMD here is that you rely on updating the > GPU page table through the GPU command processor and this > will not work with HMM. Unless you have a CP channel that > can always make forward progress no matter what. Or update > the GPU page table with the CPU and flush GPU TLB through > some register write (iirc this should be doable on AMD but > i forgetting things now). > > > There is no need for preemption even if it is a nice to have > feature (lazy GPU design architect and hardware designer ! ;)). > > > Now if dma/ttm/gem GPU memory allocation use all the system > memory then we are in trouble. This was the whole reason for > limiting ourself, once upon a time, to not use over 50% of > memory with dma/ttm/gem. I do not have any good way to solve > that. If to make forward progress the GPU needs to allocate > more memory but it already has exhausted all memory through > dma/ttm/gem then there is just nothing we can do. > > I still believe we should limit dma/ttm/gem memory usage so > that we never endup pinning all system memory in GPU realms. Well this is exactly what we're trying to do here. And to be able to shrink dma/ttm/gem memory you eventually have to call dma_fence_wait, which is where the entire trouble comes in. The other source of trouble is buffer based userptr, where we also call dma_fence_wait. And of course through depdencies (even across gpus) this dma_fence_wait could end up being blocked on running some (non-hmm) context on the same gpu engine which is currently running your hmm context and stuck on some kind of fault. Also, with the recent dma-fence annotation patch we already locked down the rule that dma_fence_wait from memory reclaim paths is ok (i915 already has no limit on system memory usage, and Christian really wants to drop it for ttm drivers too). > Hopes this help clarify requirement and expectation. So ... what now? I see a few options: - require preempt. Tough on the hardware. - rework the entire memory handling. Not sure that's possible, since it would mean that HMM gpus and buffer based memory mangement gpus cannot co-exist on the same system. Flag days like that across an entire subsystem dont work. - gang scheduling, i.e. when a gpu is running an HMM context is must guarantee that there's not a single buffer based dma-buf/ttm/gem context pending (and block any execbuf/cs ioctl before we submit a new one) - .. something else? Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel