On Tue, Nov 04, 2014 at 08:35:00AM -0800, Volkin, Bradley D wrote: > On Tue, Nov 04, 2014 at 02:17:59AM -0800, Daniel Vetter wrote: > > On Mon, Nov 03, 2014 at 11:19:42AM -0800, bradley.d.volkin@xxxxxxxxx wrote: > > Hm, no in-line clflush and cached cpu mmaps. How slow is this on vlv? > > No system to test on. Maybe we can get some qa perf testing on all relevant > platforms, since we'll want an overall sense of perf impact anyhow. > > Is there a specific change you would make here to avoid the issue? The fastest upload path afaik (without big setup costs like new pagetables) is write + immediate clflushing. But that means that the cmd parsing and write-out to the shadow batch needs to be merged first. The downside of your approache here of first setting things to the cpu write domain is that we'll incur an additional clflush pass before we copy things. Which isn't needed since we'll overwrite all the cachelines anyway. So an intermediate would be to leave the object in the GTT (flushed domain) and only clflush after the copy and cmd parser is done. But since clflush is only required on vlv (LLC machines are coherent) you really need such a box to test/develop this. So I guess we'll just wait with this. > > > + if (!ret) { > > > + ret = i915_gem_object_set_to_gtt_domain(shadow_batch_obj, > > > + false); > > > > This smells like hacking around the domain tracking system. The batch is > > executed by the gpu, gtt domain is for cpu access. This might also be a > > reason why the shrinker wreaks havoc with you shadow patches. With the > > correct pending_read_domains move_to_active should take care of things > > mostly. > > Ok, I'll look at this again. I was using the golden context batch buffer code > as a reference for this, so perhaps there's something to fix there as well. Hm, indeed that's a bit hacked up too. So let's document the sequenc as it's used by execbuf: i915_gem_execbuffer_move_to_gpu takes care of any cpu and gpu side cache flushing. For the cmd parser you should be able to reuse that just by putting the the shadow batch vma onto the execbuf list. render ctx init would need to send in a single entry vma list. i915_gem_execbuffer_move_to_active updates the domain tracking an puts the vma onto the active list. That requires correctly adjusted pending_read_domains though. With these two no set_to_gtt domain should be required for flushing data out. Mika, can you perhaps look into this, together with kerneldoc for the render init functions? > > > > > > + if (ret) > > > + DRM_DEBUG_DRIVER("CMD: Failed to set batch GTT domain\n"); > > > + } > > > > > > return ret; > > > } > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c > > > index 9a73533..dea0f45 100644 > > > --- a/drivers/gpu/drm/i915/i915_dma.c > > > +++ b/drivers/gpu/drm/i915/i915_dma.c > > > @@ -1891,6 +1891,7 @@ int i915_driver_unload(struct drm_device *dev) > > > > > > mutex_lock(&dev->struct_mutex); > > > i915_gem_cleanup_ringbuffer(dev); > > > + i915_gem_batch_pool_fini(&dev_priv->mm.batch_pool); > > > i915_gem_context_fini(dev); > > > mutex_unlock(&dev->struct_mutex); > > > i915_gem_cleanup_stolen(dev); > > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > > > index fbf10cc..50d974d 100644 > > > --- a/drivers/gpu/drm/i915/i915_drv.h > > > +++ b/drivers/gpu/drm/i915/i915_drv.h > > > @@ -1145,6 +1145,13 @@ struct i915_gem_mm { > > > */ > > > struct list_head unbound_list; > > > > > > + /* > > > + * A pool of objects to use as shadow copies of client batch buffers > > > + * when the command parser is enabled. Prevents the client from > > > + * modifying the batch contents after software parsing. > > > + */ > > > + struct i915_gem_batch_pool batch_pool; > > > + > > > /** Usable portion of the GTT for GEM */ > > > unsigned long stolen_base; /* limited to low memory (32-bit) */ > > > > > > @@ -2782,6 +2789,7 @@ void i915_cmd_parser_fini_ring(struct intel_engine_cs *ring); > > > bool i915_needs_cmd_parser(struct intel_engine_cs *ring); > > > int i915_parse_cmds(struct intel_engine_cs *ring, > > > struct drm_i915_gem_object *batch_obj, > > > + struct drm_i915_gem_object *shadow_batch_obj, > > > u32 batch_start_offset, > > > bool is_master); > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > > > index 4dbd7b9..c59100d 100644 > > > --- a/drivers/gpu/drm/i915/i915_gem.c > > > +++ b/drivers/gpu/drm/i915/i915_gem.c > > > @@ -2500,6 +2500,14 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request) > > > if (request->ctx) > > > i915_gem_context_unreference(request->ctx); > > > > > > + if (request->batch_obj && > > > + !list_empty(&request->batch_obj->batch_pool_list)) { > > > + struct drm_i915_private *dev_priv = to_i915(request->ring->dev); > > > + > > > + i915_gem_batch_pool_put(&dev_priv->mm.batch_pool, > > > + request->batch_obj); > > > + } > > > > This looks a bit like reinvented active tracking. If we copy the libdrm > > cache design a bit more the cache would simply walk the active list when > > there's nothing in the inactive list and shovel all the objects with > > ->active. > > > > The advantage would be that we wouldn't leak special batch_pool_put code > > all over the place, keeping things nice&tidy. > > Chris had also suggested looking at the libdrm cache with respect to this, > but I was concerned about the impact of the scheduler reordering requests. > I'd have to go back and look at libdrm, but it sounded like it was relying > on the fifo nature of submissions. Could be wrong though. It just walks the list and moves all the non-active stuff over. Not terribly efficient in the face of a scheduler (or multiple rings even), but I think we could start to care about this if it turns out to be a real problem. > Plus, we're hardly leaking batch_pool_put "all over the place" :). There's > exactly 3 call sites - this one for batches that execute and complete, > one for the chained batch fallback, and one in the execbuf error handling > path. So my inclination is really to leave it as is. Oh I know, everyone's just adding 3 callsites over the driver ;-) GEM is already way too complex for me to fully understand all the corners, so I tend to be fairly nasty with separation and encapsulation. E.g. I totally agree with Chris that all the execlist special cases we're sprinkling all over the driver will come back and bite us hard. Especially since they seem to be growing still with recent patches. > > > /* > > > - * XXX: Actually do this when enabling batch copy... > > > + * Set the DISPATCH_SECURE bit to remove the NON_SECURE > > > + * bit from MI_BATCH_BUFFER_START commands issued in the > > > + * dispatch_execbuffer implementations. We specifically > > > + * don't want that set when the command parser is > > > + * enabled. > > > * > > > - * Set the DISPATCH_SECURE bit to remove the NON_SECURE bit > > > - * from MI_BATCH_BUFFER_START commands issued in the > > > - * dispatch_execbuffer implementations. We specifically don't > > > - * want that set when the command parser is enabled. > > > + * NB: By setting this flag, we're going to hit the > > > + * normal secure dispatch path below. That path will > > > + * do an extra pin of the shadow batch object and then > > > + * unpin after ->gt.do_execbuf(). At that point, the > > > + * object will have the single outstanding pin from > > > + * just before calling i915_parse_cmds(), and will get > > > + * unpinned via eb_destroy() and __EXEC_OBJECT_HAS_PIN. > > > */ > > > > That sounds like this additional pin should be put into the cmd parser, > > just around the copy function. > > I suppose we could do that. I had originally been trying to keep i915_parse_cmds() > to only mapping/parsing the batch, but it's grown beyond that already I think. Yeah I think that split would make sense, but now with the batch copy it's probably better to have it within i915_cmd_parser.c > > > > > Also do_execbuf is already way too big a function. I think it would be > > good to extract all the SECURE_DISPATCH/cmd parser/batch_bo special > > handling code into a new function for clarity. And then if possible not > > leak any of the cleanup code out. > > The issue is all the potential failure points after this block. I'll look at it > again and see if there's a way to keep it clean. Yeah. Although I have to agree that the other motivation for this request was to make sure you'll have to think about no leaking such cleanups for error paths, like the unpin below ;-) > > > > > > > > + flags |= I915_DISPATCH_SECURE; > > > } > > > } > > > > > > @@ -1418,6 +1461,14 @@ err: > > > i915_gem_context_unreference(ctx); > > > eb_destroy(eb); > > > > > > + if (ret && ring && shadow_batch_obj) { > > > + if (i915_gem_obj_is_pinned(shadow_batch_obj)) > > > + i915_gem_object_ggtt_unpin(shadow_batch_obj); > > > > Which would also clean up this code, which looks more like a hack to work > > around the pin_count check on final unref than anything solid and > > intentional ;-) > > This is error path cleanup, at least for a window where we've allocated the > shadow batch, pinned it, but haven't hooked it into the vma list. Again, > I'll see if there's a cleaner way to do it, but I don't think it's that bad. I think if you put the pinning into the cmd parser function it should just vanish. Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx