On Tue, Jan 20, 2015 at 1:46 PM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote: > On Tue, Jan 20, 2015 at 12:53:35PM -0800, Kristian Høgsberg wrote: >> On Tue, Jan 20, 2015 at 12:42 AM, Daniel Vetter <daniel@xxxxxxxx> wrote: >> > My idea for all this would have been to create a per-thread execbuf >> > relocation context with a hashtab to map buffer pointers to execbuf index >> > and a bunch of arrays to prepare the reloc entry tables. If you do it >> > correctly all the per-reloc work should be a O(1) streaming writes to a >> > few arrays plus the hashtab lookup. With no code run at execbuf time >> > (except the ioctl ofc). Even the libdrm_bo->presumed_offset update after >> > execbuf could be done lockless (as long as readers are careful to never >> > reload it by using something similar to the kernel's READ_ONCE macro). >> > >> > But that means a completely new reloc api, so a lot more work. Also I >> > think it only makes sense do that for drivers that really care about the >> > last bit of performance, and then do it within the driver so that there's >> > no constraints about abi. >> >> Indeed, I moved it into mesa so I could rework that. bo_emit_reloc() >> is showing up in profiles. The patch below along with NO_RELOC and >> HANDLE_LUT flags gives me 5-10% improement on CPU bound benchmarks, so >> it's certainly worth it. I'm skeptical that a hashtable lookup per >> reloc emit is going to perform better than just fixing up the relocs >> at execbuf2 time though. It would be nice to not do any work at ioctl >> time, but for that you need a very fast way to map from bo to >> per-thread bo state as you go. Maybe a per-thread array mapping from >> gem handle to exec_object could work... >> >> WIP Patch is here: >> >> http://cgit.freedesktop.org/~krh/mesa/commit/?h=b0e4ce7bbce2a79ad37d6de460af88b9581ea1d7 > > Hmm, that is actually pretty neat. My idle thought was to create > per-context batchmgr with its own view of the bo (to counter the > multithreaded free-for-all). In your patch, you neatly demonstrate that > you don't need per-context view of the bo, only of the relocations. And > it will make drm_intel_bo_emit_reloc() fixed cost, which should produce > most of your CPU overhead saving. > > However, I think if you do take it a step further with a batchmgr_bo, > you can make the drm_intel_bo_references() very cheap as well. I did think about a per-thread/context bo wrapper that could track validation list index and presumed_offset, but the problem is that there's only one intel_mipmap_tree/intel_buffer_object struct to store it in. We'd have to a per-thread wrapper for those too, and I'm not sure that's feasible. However, drm_intel_bo_reference/unreference() is showing up on the profiles now that relocations are cheaper, but I think the better way to make those cheaper is to move the ref count to the public struct and make the ref/unref inline functions (calling out to a destructor for the refcount==0 case, of course). On that note, do you know why drm_intel_gem_bo_unreference() is so convoluted? What's the deal with the atomic_add_unless? > Looks good. > -Chris > > -- > Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx