Re: [PATCH 1/2] drm/ttm: rework ttm_tt page limit v2

Daniel Vetter <daniel@xxxxxxxx> · Thu, 17 Dec 2020 20:22:33 +0100



On Thu, Dec 17, 2020 at 7:26 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
>
> On Thu, Dec 17, 2020 at 7:19 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
> >
> > On Thu, Dec 17, 2020 at 7:09 PM Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
> > >
> > > Adding few folks on cc just to raise awareness and so that i
> > > could get corrected if i said anything wrong.
> > >
> > > On Thu, Dec 17, 2020 at 04:45:55PM +0100, Daniel Vetter wrote:
> > > > On Thu, Dec 17, 2020 at 4:36 PM Christian König
> > > > <christian.koenig@xxxxxxx> wrote:
> > > > > Am 17.12.20 um 16:26 schrieb Daniel Vetter:
> > > > > > On Thu, Dec 17, 2020 at 4:10 PM Christian König
> > > > > > <christian.koenig@xxxxxxx> wrote:
> > > > > >> Am 17.12.20 um 15:36 schrieb Daniel Vetter:
> > > > > >>> On Thu, Dec 17, 2020 at 2:46 PM Christian König
> > > > > >>> <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
> > > > > >>>> Am 16.12.20 um 16:09 schrieb Daniel Vetter:
> > > > > >>>>> On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote:
> > > > > >>>>> [SNIP]
> > > > > >>>>>> +
> > > > > >>>>>> +/* As long as pages are available make sure to release at least one */
> > > > > >>>>>> +static unsigned long ttm_tt_shrinker_scan(struct shrinker *shrink,
> > > > > >>>>>> +                                      struct shrink_control *sc)
> > > > > >>>>>> +{
> > > > > >>>>>> +    struct ttm_operation_ctx ctx = {
> > > > > >>>>>> +            .no_wait_gpu = true
> > > > > >>>>> Iirc there's an eventual shrinker limit where it gets desperate. I think
> > > > > >>>>> once we hit that, we should allow gpu waits. But it's not passed to
> > > > > >>>>> shrinkers for reasons, so maybe we should have a second round that tries
> > > > > >>>>> to more actively shrink objects if we fell substantially short of what
> > > > > >>>>> reclaim expected us to do?
> > > > > >>>> I think we should try to avoid waiting for the GPU in the shrinker callback.
> > > > > >>>>
> > > > > >>>> When we get HMM we will have cases where the shrinker is called from
> > > > > >>>> there and we can't wait for the GPU then without causing deadlocks.
> > > > > >>> Uh that doesn't work. Also, the current rules are that you are allowed
> > > > > >>> to call dma_fence_wait from shrinker callbacks, so that shipped sailed
> > > > > >>> already. This is because shrinkers are a less restrictive context than
> > > > > >>> mmu notifier invalidation, and we wait in there too.
> > > > > >>>
> > > > > >>> So if you can't wait in shrinkers, you also can't wait in mmu
> > > > > >>> notifiers (and also not in HMM, wĥich is the same thing). Why do you
> > > > > >>> need this?
> > > > > >> The core concept of HMM is that pages are faulted in on demand and it is
> > > > > >> perfectly valid for one of those pages to be on disk.
> > > > > >>
> > > > > >> So when a page fault happens we might need to be able to allocate memory
> > > > > >> and fetch something from disk to handle that.
> > > > > >>
> > > > > >> When this memory allocation then in turn waits for the GPU which is
> > > > > >> running the HMM process we are pretty much busted.
> > > > > > Yeah you can't do that. That's the entire infinite fences discussions.
> > > > >
> > > > > Yes, exactly.
> > > > >
> > > > > > For HMM to work, we need to stop using dma_fence for userspace sync,
> > > > >
> > > > > I was considering of separating that into a dma_fence and a hmm_fence.
> > > > > Or something like this.
> > > >
> > > > The trouble is that dma_fence it all its forms is uapi. And on gpus
> > > > without page fault support dma_fence_wait is still required in
> > > > allocation contexts. So creating a new kernel structure doesn't really
> > > > solve anything I think, it needs entire new uapi completely decoupled
> > > > from memory management. Last time we've done new uapi was probably
> > > > modifiers, and that's still not rolled out years later.
> > >
> > > With hmm there should not be any fence ! You do not need them.
> > > If you feel you need them than you are doing something horribly
> > > wrong. See below on what HMM needs and what it means.
> > >
> > >
> > > > > > and you can only use the amdkfd style preempt fences. And preempting
> > > > > > while the pagefault is pending is I thought something we require.
> > > > >
> > > > > Yeah, problem is that most hardware can't do that :)
> > > > >
> > > > > Getting page faults to work is hard enough, preempting while waiting for
> > > > > a fault to return is not something which was anticipated :)
> > > >
> > > > Hm last summer in a thread you said you've blocked that because it
> > > > doesn't work. I agreed, page fault without preempt is rather tough to
> > > > make work.
> > > >
> > > > > > Iow, the HMM page fault handler must not be a dma-fence critical
> > > > > > section, i.e. it's not allowed to hold up any dma_fence, ever.
> > > > >
> > > > > What do you mean with that?
> > > >
> > > > dma_fence_signalling_begin/end() annotations essentially, i.e.
> > > > cross-release dependencies. Or the other way round, if you want to be
> > > > able to allocate memory you have to guarantee that you're never
> > > > holding up a dma_fence.
> > >
> > > Correct nothing regarding dma/ttm/gem should creep into HMM code
> > > path.
> > >
> > >
> > > For HMM what you want when handling GPU fault is doing it without
> > > holding any GPU driver locks so that the regular page fault handler
> > > code path can go back into the GPU driver (through shrinker) without
> > > worrying about it.
> > >
> > > This is how nouveau does it:
> > >     - get event about page fault (might hold some GPU lock)
> > >     - walk the event buffer to get all faulting addresses
> > >       (might hold some GPU lock)
> > >
> > >     ! DROP ALL GPU/DRIVER LOCK !
> > >
> > >     - try to coallesce fault together (adjacent address
> > >       trigger a fault for a single range)
> > >     - call in HMM/mmu notifier helpers to handle the fault
> > >     - take GPU lock (svmm->mutex for nouveau)
> > >     - from the fault result update GPU page table
> > >
> > >     - Tell GPU it can resume (this can be done after releasing
> > >       the lock below, it is just up to the device driver folks
> > >       to decide when to do that).
> > >
> > >     - unlock GPU (svmm->mutex for nouveau)
> > >
> > > (see nouveau_svm.c look for range_fault)
> > >
> > > GPU page fault should never need to rely on shrinker to succeed
> > > into reclaiming memory. The rational here is that regular memory
> > > reclaim (which includes regular anonymous or mmap file back
> > > memory) will be able to make forward progress even during GPU
> > > page fault. This of course means that while servicing a GPU page
> > > fault you might freeup memory that was also use by the GPU and
> > > which will re-fault again but the lru list should make that very
> > > unlikely.
> > >
> > >
> > > Note that for regular memory reclaim to make forward progress
> > > means that the GPU _must_ support asynchronous GPU page table
> > > update ie being able to unmap things from GPU page table and
> > > flush GPU TLB without having to wait for anything running on
> > > the GPU. This is where not all hardware might work. Only things
> > > populated through HMM need to be unmapped (dma/ttm/gem should
> > > not be considered unless they too can be unmapped without
> > > deadlock).
> > >
> > > For nouveau this is done through register write (see
> > > gf100_vmm_invalidate and gp100_vmm_invalidate_pdb and the
> > > GPU page table itself is updated through the CPU).
> > >
> > > I think issue for AMD here is that you rely on updating the
> > > GPU page table through the GPU command processor and this
> > > will not work with HMM. Unless you have a CP channel that
> > > can always make forward progress no matter what. Or update
> > > the GPU page table with the CPU and flush GPU TLB through
> > > some register write (iirc this should be doable on AMD but
> > > i forgetting things now).
> > >
> > >
> > > There is no need for preemption even if it is a nice to have
> > > feature (lazy GPU design architect and hardware designer ! ;)).
> > >
> > >
> > > Now if dma/ttm/gem GPU memory allocation use all the system
> > > memory then we are in trouble. This was the whole reason for
> > > limiting ourself, once upon a time, to not use over 50% of
> > > memory with dma/ttm/gem. I do not have any good way to solve
> > > that. If to make forward progress the GPU needs to allocate
> > > more memory but it already has exhausted all memory through
> > > dma/ttm/gem then there is just nothing we can do.
> > >
> > > I still believe we should limit dma/ttm/gem memory usage so
> > > that we never endup pinning all system memory in GPU realms.
> >
> > Well this is exactly what we're trying to do here. And to be able to
> > shrink dma/ttm/gem memory you eventually have to call dma_fence_wait,
> > which is where the entire trouble comes in.
> >
> > The other source of trouble is buffer based userptr, where we also
> > call dma_fence_wait.
> >
> > And of course through depdencies (even across gpus) this
> > dma_fence_wait could end up being blocked on running some (non-hmm)
> > context on the same gpu engine which is currently running your hmm
> > context and stuck on some kind of fault.
> >
> > Also, with the recent dma-fence annotation patch we already locked
> > down the rule that dma_fence_wait from memory reclaim paths is ok
> > (i915 already has no limit on system memory usage, and Christian
> > really wants to drop it for ttm drivers too).
> >
> > > Hopes this help clarify requirement and expectation.
> >
> > So ... what now?
> >
> > I see a few options:
> > - require preempt. Tough on the hardware.
> > - rework the entire memory handling. Not sure that's possible, since
> > it would mean that HMM gpus and buffer based memory mangement gpus
> > cannot co-exist on the same system. Flag days like that across an
> > entire subsystem dont work.
> > - gang scheduling, i.e. when a gpu is running an HMM context is must
> > guarantee that there's not a single buffer based dma-buf/ttm/gem
> > context pending (and block any execbuf/cs ioctl before we submit a new
> > one)
>
> This here wouldn't be the worst solution, even on desktops:
> - we already stall at execbuf for timeline semaphores, so this isn't even new
> - when the gpu is occupied by an hmm task, then execbuf/cs ioctl can
> issue a low-priority preempt request to make sure the desktop doesn't
> freeze forever. As long as that happens before we have a dma_fence for
> that cs created, it's all fine to preempt HMM context.
> - preempted HMM tasks would immediately try to re-run, to make sure
> that they get fair scheduling between the execbuf tasks
>
> It's not going to be great since the pipeline is down the gutters, but
> no frozen desktop while running HMM jobs. And as long as you do
> majority time only one or the other the gpu runs at full speed with
> all the pipelining. So no perf impact on gaming or on overnight
> compute jobs.

Also, since this gang scheduling mode between hmm and non-hmm mode is
per-engine, this also solves the problem with using the copy engines
to move vram or write pagetables: As long as the kernel guarantees to
never run an hmm task on such an engine (simplest to do by outright
reserving one for the kernel exclusively) you won't deadlock between
dma_fence ctx and hmm ctx biting each another. That way we don't have
to rewrite the entire vram mangament for eventual ZONE_DEVICE world. I
think at least.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel