The primary goal of these patches is to introduce what I've started calling, "prelocations" on Broadwell. A prelocation is like a relocation, except not. When a GPU client specifies a prelocation, it is instructing the kernel where in the GPU address the buffer should be mapped. The mechanic works very similarly to a relocation except it uses the execbuffer object to obtain the offset, and bind if needed. If a GPU client uses only prelocations, the relocation process can be entirely skipped. This sounds like a big win initially, but realistically with full PPGTT and 48b address space it's unlikely to noticeably improve anything. Doing this work leaves the address space allocation up to libc/malloc [1] instead of drm_mm which I believe has some upside due to the hits on creating new VMAs. Not specific to prelocations, dynamic page table allocations by themselves can save measurable memory on systems running multiple GPU clients. As previously mentioned, this kind of thing is needed for OCL 2.0 SVM. One other advantage I've discussed with Ken... [2]. The difficult part to enable this [for 64b platforms] is supporting the 48b address space. As mentioned in previous versions of this cover letter, and my blog post [3], it's not feasible to allocate the entire 48b address space's page tables. Dynamic page table allocation and teardown required a lot of plumbing and rework, and to make the interfaces as neat as possible, I also had to put a good deal of work into GEN7 PPGTT well. The other really difficult part is taking the malloc'd memory and turning it into GPU usable pages. Luckily, Chris already did that for me with userptr, so I simply reused his work. The kernel patches are lightly tested at best. Previous iterations of this series were more thoroughly tested, but enough has changed since then that I would assume the code is unstable. If miraculously it is almost stable, there are still a lot of cosmetic things to clean up, and a performance optimization to reduce re-mapping already mapped objects. I started on a patch to do this but ran into too many stability problems (See Optimize PDP loads from previous posts). It's likely memory leaks are introduced with the dynamic page tables; plugging those would nice. One could also implement the reaper I refer to in the comments. With the kernel prelocation support are the libdrm patches, an intel-gpu-tools test, and a mesa patch. Some parts of the code are in rough shape, and were meant for demonstration only. The userspace components in particular were mostly meant as sample code. [4] The series is fundamental 5 parts with some bleeding between 2-3, and 3-4. 1. [00-18] Provide fixes to make a stable branch for test with full PPGTT. I've previously posted this as a separate series. In the meanwhile, many similar fixes have gone in, and some of these may be dropped. So this is mostly here for completeness. 2. [19-42] Rework code to avoid as much future churn as possible. Nothing special here. Some of this is arguably #3. 3. [43-46] Make page table allocations dynamic. I tried to keep this generic, but since the current code supported very specific page table depths, it's really mostly GEN7. 4. [47-67] GEN8 dynamic page table support with 64b page table support. This was very hard to split up, and is definitely the majority of the work. 5. [68] A basic SVM interface. I opted not to use create2 IOCTL since there are patches for that already, and I wanted to have something that's as reusable as possible. X. the rest are workaround/libdrm/mesa/igt Kernel: http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=prelocate libdrm: http://cgit.freedesktop.org/~bwidawsk/drm/log/?h=prelocate mesa: http://cgit.freedesktop.org/~bwidawsk/mesa/log/?h=prelocate IGT: http://cgit.freedesktop.org/~bwidawsk/intel-gpu-tools/log/?h=prelocate Final thoughts: * Due to time pressure, the ability to go back and test on GEN7 was lost. The original patches I posted back in March did work fine on GEN7, but I cannot speak to the quality now. That said, I did the work, so I figured I may as well provide it. For the sake of progress, someone should test/fix GEN7, or simply drop the GEN7 support. * Broadwell is currently hanging with this patch series when I run piglit. I have gone through plenty of software bugs, and this current hang is baffling. Therefore I think it makes sense to either parameterize, or CONFIG_ dynamic page table allocations until that's solved. * Again on the stability, there are a lot of extra flushes introduced as a result of this series. I believe if we can figure out the case of some of these issues, we can remove some flushes. * I haven't tested aliasing PPGTT only in a while. Someone should do that. * I'll bet 32b is broken. * A lot of issues I had were related to the complexities when dealing with legacy contexts. It's possible, and I am hopeful that with execlists these issues go away, and so do the hangs. * The patches have been rebased SOOOOO many times that they really need to be reviewed closely to make sure they're bisectable. They were at one time, but I doubt it's the case now. [1] We have to use mmap in certain situations due to a hardware limitation. I'm not sure how libc manages these things together. I hope it's efficient... [2] We can potentially always set the state base to be 0, and rely on HW contexts to save restore this information, thus eliminating this non-pipelined state upload. It turns out this is not possible for all cases because of hardware limitations, but it's a neat idea that someone can possibly turn into something useful. It's also probably a premature optimization given how many PIPE CONTROL stalls we have. [3] https://bwidawsk.net/blog/index.php/2014/07/future-ppgtt-part-4-dynamic-page-table-allocations-64-bit-address-space-gpu-mirroring-and-yeah-something-about-relocs-too/ [4] This was the best I could do on short notice. I won't be improving, rebasing, or fixing these patches any longer, but someone is welcome to take them over. Consider this my parting gift before I go on sabbatical [tomorrow]. -- Ben Widawsky (68): drm/i915: Split up do_switch drm/i915: Extract l3 remapping out of ctx switch drm/i915/ppgtt: Load address space after mi_set_context drm/i915: Fix another another use-after-free in do_switch drm/i915/ctx: Return earlier on failure drm/i915/error: vma error capture prettyify drm/i915/error: Do a better job of disambiguating VMAs drm/i915/error: Capture vmas instead of BOs drm/i915: Add some extra guards in evict_vm drm/i915: Make an uninterruptible evict drm/i915: More correct (slower) ppgtt cleanup drm/i915: Defer PPGTT cleanup drm/i915/bdw: Enable full PPGTT drm/i915: Get the error state over the wire (HACKish) drm/i915/gen8: Invalidate TLBs before PDP reload drm/i915: Remove false assertion in ppgtt_release Revert "drm/i915/bdw: Use timeout mode for RC6 on bdw" drm/i915/trace: Fix offsets for 64b drm/i915: Wrap VMA binding drm/i915: Make pin global flags explicit drm/i915: Split out aliasing binds drm/i915: fix gtt_total_entries() drm/i915: Rename to GEN8_LEGACY_PDPES drm/i915: Split out verbose PPGTT dumping drm/i915: s/pd/pdpe, s/pt/pde drm/i915: rename map/unmap to dma_map/unmap drm/i915: Setup less PPGTT on failed pagedir drm/i915: clean up PPGTT init error path drm/i915: Un-hardcode number of page directories drm/i915: Make gen6_write_pdes gen6_map_page_tables drm/i915: Range clearing is PPGTT agnostic drm/i915: Page table helpers, and define renames drm/i915: construct page table abstractions drm/i915: Complete page table structures drm/i915: Create page table allocators drm/i915: Generalize GEN6 mapping drm/i915: Clean up pagetable DMA map & unmap drm/i915: Always dma map page table allocations drm/i915: Consolidate dma mappings drm/i915: Always dma map page directory allocations drm/i915: Track GEN6 page table usage drm/i915: Extract context switch skip logic drm/i915: Track page table reload need drm/i915: Initialize all contexts drm/i915: Finish gen6/7 dynamic page table allocation drm/i915/bdw: Use dynamic allocation idioms on free drm/i915/bdw: pagedirs rework allocation drm/i915/bdw: pagetable allocation rework drm/i915/bdw: Make the pdp switch a bit less hacky drm/i915: num_pd_pages/num_pd_entries isn't useful drm/i915: Extract PPGTT param from pagedir alloc drm/i915/bdw: Split out mappings drm/i915/bdw: begin bitmap tracking drm/i915/bdw: Dynamic page table allocations drm/i915/bdw: Make pdp allocation more dynamic drm/i915/bdw: Abstract PDP usage drm/i915/bdw: Add dynamic page trace events drm/i915/bdw: Add ppgtt info for dynamic pages drm/i915/bdw: implement alloc/teardown for 4lvl drm/i915/bdw: Add 4 level switching infrastructure drm/i915/bdw: Generalize PTE writing for GEN8 PPGTT drm/i915: Plumb sg_iter through va allocation ->maps drm/i915: Introduce map and unmap for VMAs drm/i915: Depend exclusively on map and unmap_vma drm/i915: Expand error state's address width to 64b drm/i915/bdw: Flip the 48b switch drm/i915: Provide a soft_pin hook XXX: drm/i915: Unexplained workarounds drivers/gpu/drm/i915/i915_debugfs.c | 114 +- drivers/gpu/drm/i915/i915_drv.h | 61 +- drivers/gpu/drm/i915/i915_gem.c | 231 +++- drivers/gpu/drm/i915/i915_gem_context.c | 276 ++++- drivers/gpu/drm/i915/i915_gem_evict.c | 39 +- drivers/gpu/drm/i915/i915_gem_execbuffer.c | 27 +- drivers/gpu/drm/i915/i915_gem_gtt.c | 1838 +++++++++++++++++++++------- drivers/gpu/drm/i915/i915_gem_gtt.h | 379 +++++- drivers/gpu/drm/i915/i915_gem_stolen.c | 2 +- drivers/gpu/drm/i915/i915_gem_userptr.c | 7 +- drivers/gpu/drm/i915/i915_gpu_error.c | 171 ++- drivers/gpu/drm/i915/i915_reg.h | 1 + drivers/gpu/drm/i915/i915_sysfs.c | 2 +- drivers/gpu/drm/i915/i915_trace.h | 156 ++- drivers/gpu/drm/i915/intel_pm.c | 16 +- drivers/gpu/drm/i915/intel_ringbuffer.c | 2 +- include/uapi/drm/i915_drm.h | 3 +- 17 files changed, 2588 insertions(+), 737 deletions(-) -- 2.0.4 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx