Am 2021-02-09 um 9:08 a.m. schrieb Daniel Vetter: > On Tue, Feb 9, 2021 at 12:15 PM Felix Kuehling <felix.kuehling@xxxxxxx> wrote: >> Am 2021-02-09 um 1:37 a.m. schrieb Daniel Vetter: >>> On Tue, Feb 9, 2021 at 4:13 AM Bas Nieuwenhuizen >>> <bas@xxxxxxxxxxxxxxxxxxx> wrote: >>>> On Thu, Jan 28, 2021 at 4:40 PM Felix Kuehling <felix.kuehling@xxxxxxx> wrote: >>>>> Am 2021-01-28 um 2:39 a.m. schrieb Christian König: >>>>>> Am 27.01.21 um 23:00 schrieb Felix Kuehling: >>>>>>> Am 2021-01-27 um 7:16 a.m. schrieb Christian König: >>>>>>>> Am 27.01.21 um 13:11 schrieb Maarten Lankhorst: >>>>>>>>> Op 27-01-2021 om 01:22 schreef Felix Kuehling: >>>>>>>>>> Am 2021-01-21 um 2:40 p.m. schrieb Daniel Vetter: >>>>>>>>>>> Recently there was a fairly long thread about recoreable hardware >>>>>>>>>>> page >>>>>>>>>>> faults, how they can deadlock, and what to do about that. >>>>>>>>>>> >>>>>>>>>>> While the discussion is still fresh I figured good time to try and >>>>>>>>>>> document the conclusions a bit. >>>>>>>>>>> >>>>>>>>>>> References: >>>>>>>>>>> https://lore.kernel.org/dri-devel/20210107030127.20393-1-Felix.Kuehling@xxxxxxx/ >>>>>>>>>>> >>>>>>>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxxxx> >>>>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxx> >>>>>>>>>>> Cc: "Christian König" <christian.koenig@xxxxxxx> >>>>>>>>>>> Cc: Jerome Glisse <jglisse@xxxxxxxxxx> >>>>>>>>>>> Cc: Felix Kuehling <felix.kuehling@xxxxxxx> >>>>>>>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxxx> >>>>>>>>>>> Cc: Sumit Semwal <sumit.semwal@xxxxxxxxxx> >>>>>>>>>>> Cc: linux-media@xxxxxxxxxxxxxxx >>>>>>>>>>> Cc: linaro-mm-sig@xxxxxxxxxxxxxxxx >>>>>>>>>>> -- >>>>>>>>>>> I'll be away next week, but figured I'll type this up quickly for >>>>>>>>>>> some >>>>>>>>>>> comments and to check whether I got this all roughly right. >>>>>>>>>>> >>>>>>>>>>> Critique very much wanted on this, so that we can make sure hw which >>>>>>>>>>> can't preempt (with pagefaults pending) like gfx10 has a clear >>>>>>>>>>> path to >>>>>>>>>>> support page faults in upstream. So anything I missed, got wrong or >>>>>>>>>>> like that would be good. >>>>>>>>>>> -Daniel >>>>>>>>>>> --- >>>>>>>>>>> Documentation/driver-api/dma-buf.rst | 66 >>>>>>>>>>> ++++++++++++++++++++++++++++ >>>>>>>>>>> 1 file changed, 66 insertions(+) >>>>>>>>>>> >>>>>>>>>>> diff --git a/Documentation/driver-api/dma-buf.rst >>>>>>>>>>> b/Documentation/driver-api/dma-buf.rst >>>>>>>>>>> index a2133d69872c..e924c1e4f7a3 100644 >>>>>>>>>>> --- a/Documentation/driver-api/dma-buf.rst >>>>>>>>>>> +++ b/Documentation/driver-api/dma-buf.rst >>>>>>>>>>> @@ -257,3 +257,69 @@ fences in the kernel. This means: >>>>>>>>>>> userspace is allowed to use userspace fencing or long running >>>>>>>>>>> compute >>>>>>>>>>> workloads. This also means no implicit fencing for shared >>>>>>>>>>> buffers in these >>>>>>>>>>> cases. >>>>>>>>>>> + >>>>>>>>>>> +Recoverable Hardware Page Faults Implications >>>>>>>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>>>>>>>> + >>>>>>>>>>> +Modern hardware supports recoverable page faults, which has a >>>>>>>>>>> lot of >>>>>>>>>>> +implications for DMA fences. >>>>>>>>>>> + >>>>>>>>>>> +First, a pending page fault obviously holds up the work that's >>>>>>>>>>> running on the >>>>>>>>>>> +accelerator and a memory allocation is usually required to resolve >>>>>>>>>>> the fault. >>>>>>>>>>> +But memory allocations are not allowed to gate completion of DMA >>>>>>>>>>> fences, which >>>>>>>>>>> +means any workload using recoverable page faults cannot use DMA >>>>>>>>>>> fences for >>>>>>>>>>> +synchronization. Synchronization fences controlled by userspace >>>>>>>>>>> must be used >>>>>>>>>>> +instead. >>>>>>>>>>> + >>>>>>>>>>> +On GPUs this poses a problem, because current desktop compositor >>>>>>>>>>> protocols on >>>>>>>>>>> +Linus rely on DMA fences, which means without an entirely new >>>>>>>>>>> userspace stack >>>>>>>>>>> +built on top of userspace fences, they cannot benefit from >>>>>>>>>>> recoverable page >>>>>>>>>>> +faults. The exception is when page faults are only used as >>>>>>>>>>> migration hints and >>>>>>>>>>> +never to on-demand fill a memory request. For now this means >>>>>>>>>>> recoverable page >>>>>>>>>>> +faults on GPUs are limited to pure compute workloads. >>>>>>>>>>> + >>>>>>>>>>> +Furthermore GPUs usually have shared resources between the 3D >>>>>>>>>>> rendering and >>>>>>>>>>> +compute side, like compute units or command submission engines. If >>>>>>>>>>> both a 3D >>>>>>>>>>> +job with a DMA fence and a compute workload using recoverable page >>>>>>>>>>> faults are >>>>>>>>>>> +pending they could deadlock: >>>>>>>>>>> + >>>>>>>>>>> +- The 3D workload might need to wait for the compute job to finish >>>>>>>>>>> and release >>>>>>>>>>> + hardware resources first. >>>>>>>>>>> + >>>>>>>>>>> +- The compute workload might be stuck in a page fault, because the >>>>>>>>>>> memory >>>>>>>>>>> + allocation is waiting for the DMA fence of the 3D workload to >>>>>>>>>>> complete. >>>>>>>>>>> + >>>>>>>>>>> +There are a few ways to prevent this problem: >>>>>>>>>>> + >>>>>>>>>>> +- Compute workloads can always be preempted, even when a page >>>>>>>>>>> fault is pending >>>>>>>>>>> + and not yet repaired. Not all hardware supports this. >>>>>>>>>>> + >>>>>>>>>>> +- DMA fence workloads and workloads which need page fault handling >>>>>>>>>>> have >>>>>>>>>>> + independent hardware resources to guarantee forward progress. >>>>>>>>>>> This could be >>>>>>>>>>> + achieved through e.g. through dedicated engines and minimal >>>>>>>>>>> compute unit >>>>>>>>>>> + reservations for DMA fence workloads. >>>>>>>>>>> + >>>>>>>>>>> +- The reservation approach could be further refined by only >>>>>>>>>>> reserving the >>>>>>>>>>> + hardware resources for DMA fence workloads when they are >>>>>>>>>>> in-flight. This must >>>>>>>>>>> + cover the time from when the DMA fence is visible to other >>>>>>>>>>> threads up to >>>>>>>>>>> + moment when fence is completed through dma_fence_signal(). >>>>>>>>>>> + >>>>>>>>>>> +- As a last resort, if the hardware provides no useful reservation >>>>>>>>>>> mechanics, >>>>>>>>>>> + all workloads must be flushed from the GPU when switching >>>>>>>>>>> between jobs >>>>>>>>>>> + requiring DMA fences or jobs requiring page fault handling: This >>>>>>>>>>> means all DMA >>>>>>>>>>> + fences must complete before a compute job with page fault >>>>>>>>>>> handling can be >>>>>>>>>>> + inserted into the scheduler queue. And vice versa, before a DMA >>>>>>>>>>> fence can be >>>>>>>>>>> + made visible anywhere in the system, all compute workloads must >>>>>>>>>>> be preempted >>>>>>>>>>> + to guarantee all pending GPU page faults are flushed. >>>>>>>>>> I thought of another possible workaround: >>>>>>>>>> >>>>>>>>>> * Partition the memory. Servicing of page faults will use a >>>>>>>>>> separate >>>>>>>>>> memory pool that can always be allocated from without >>>>>>>>>> waiting for >>>>>>>>>> fences. This includes memory for page tables and memory for >>>>>>>>>> migrating data to. You may steal memory from other processes >>>>>>>>>> that >>>>>>>>>> can page fault, so no fence waiting is necessary. Being able to >>>>>>>>>> steal memory at any time also means there are basically no >>>>>>>>>> out-of-memory situations you need to worry about. Even page >>>>>>>>>> tables >>>>>>>>>> (except the root page directory of each process) can be >>>>>>>>>> stolen in >>>>>>>>>> the worst case. >>>>>>>>> I think 'overcommit' would be a nice way to describe this. But I'm not >>>>>>>>> sure how easy this is to implement in practice. You would basically >>>>>>>>> need >>>>>>>>> to create your own memory manager for this. >>>>>>>> Well you would need a completely separate pool for both device as well >>>>>>>> as system memory. >>>>>>>> >>>>>>>> E.g. on boot we say we steal X GB system memory only for HMM. >>>>>>> Why? The GPU driver doesn't need to allocate system memory for HMM. >>>>>>> Migrations to system memory are handled by the kernel's handle_mm_fault >>>>>>> and page allocator and swap logic. >>>>>> And that one depends on dma_fence completion because you can easily >>>>>> need to wait for an MMU notifier callback. >>>>> I see, the GFX MMU notifier for userpointers in amdgpu currently waits >>>>> for fences. For the KFD MMU notifier I am planning to fix this by >>>>> causing GPU page faults instead of preempting the queues. Can we limit >>>>> userptrs in amdgpu to engines that can page fault. Basically make it >>>>> illegal to attach userptr BOs to graphics CS BO lists, so they can only >>>>> be used in user mode command submissions, which can page fault. Then the >>>>> GFX MMU notifier could invalidate PTEs and would not have to wait for >>>>> fences. >>>> sadly graphics + userptr is already exposed via Mesa. >>> This is not about userptr, we fake userptr entirely in software. It's >>> about exposing recoverable gpu page faults (which would make userptr >>> maybe more efficient since we could do on-demand paging). userptr >>> itself isn't a problem, but it is part of the reasons why this is >>> tricky. >>> >>> Christian/Felix, I think for kernel folks this is clear enough that I >>> don't need to clarify this in the text? >> Yeah, it's clear to me. Anyway, your latest text doesn't reference >> userptr directly and keeps the discussion at a fairly abstract level. So >> I think it's fine. It's the practical details of the proposed >> workarounds where it feel like walking through a mirror cabinet, bumping >> into unexpected obstacles with every other step. > Oh yes, this is very high-level. The implementation is going to be > very trick, no matter which one we're picking. And tbh I expect > surprises and things we'll learn. But I'm still hoping that this high > level doc patch will help a lot with avoiding the worst problems. > > Of course once we have some of these hacks landed we should look at it > again and maybe update where it's wrong/unclear/... > > btw r-b: from you too on the patch? Yes. Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> Thanks, Felix > > Cheers, Daniel > >> Regards, >> Felix >> >> >>> -Daniel >>> >>>>>> As Maarten wrote when you want to go down this route you need a >>>>>> complete separate memory management parallel to the one of the kernel. >>>>> Not really. I'm trying to make the GPU memory management more similar to >>>>> what the kernel does for system memory. >>>>> >>>>> I understood Maarten's comment as "I'm creating a new memory manager and >>>>> not using TTM any more". This is true. The idea is that this portion of >>>>> VRAM would be managed more like system memory. >>>>> >>>>> Regards, >>>>> Felix >>>>> >>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> It doesn't depend on any fences, so >>>>>>> it cannot deadlock with any GPU driver-managed memory. The GPU driver >>>>>>> gets involved in the MMU notifier to invalidate device page tables. But >>>>>>> that also doesn't need to wait for any fences. >>>>>>> >>>>>>> And if the kernel runs out of pageable memory, you're in trouble anyway. >>>>>>> The OOM killer will step in, nothing new there. >>>>>>> >>>>>>> Regards, >>>>>>> Felix >>>>>>> >>>>>>> >>>>>>>>> But from a design point of view, definitely a valid solution. >>>>>>>> I think the restriction above makes it pretty much unusable. >>>>>>>> >>>>>>>>> But this looks good, those solutions are definitely the valid >>>>>>>>> options we >>>>>>>>> can choose from. >>>>>>>> It's certainly worth noting, yes. And just to make sure that nobody >>>>>>>> has the idea to reserve only device memory. >>>>>>>> >>>>>>>> Christian. >>>>>>>> >>>>>>>>> ~Maarten >>>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Linaro-mm-sig mailing list >>>>>>> Linaro-mm-sig@xxxxxxxxxxxxxxxx >>>>>>> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig >>>>>>> >>>>> _______________________________________________ >>>>> dri-devel mailing list >>>>> dri-devel@xxxxxxxxxxxxxxxxxxxxx >>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel >>> > >