> -----Original Message----- > From: Intel-gfx [mailto:intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx] On Behalf > Of Arun Siluvery > Sent: Wednesday, January 13, 2016 5:09 PM > To: Gordon, David S <david.s.gordon@xxxxxxxxx>; intel- > gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/i915: Expose exec parameter to force > non IA-Coherent for Gen9+ > > On 13/01/2016 14:11, Dave Gordon wrote: > > On 13/01/16 12:53, Chris Wilson wrote: > >> On Wed, Jan 13, 2016 at 11:44:39AM +0100, Artur Harasimiuk wrote: > >>> Starting from Gen9 we can use IA-Coherent caches. Coherence may be > >>> not required in certain cases and can be disabled in an easy way. To > >>> do this we can set HDC_FORCE_NON_COHERENT bit in HDC_CHICKEN0 > >>> register. This register is part of HW context, however it is private > >>> and cannot be programmed from non-privileged batch buffer. > >>> > Looks like what you need is to add this register to HW whitelist so that > UMD can program required behaviour between batches but you need strong > justification and approval to whitelist a register. Is there a real > usecase or this is only for debug? > Whitelisting was already refused. > regards > Arun > > >>> New parameter is to override default programming and allow UMD to > >>> decide whether IA-Coherency is not needed for submitted batch buffer. > >>> When flag is set KMD emits commands to disable coherency before > >>> batch buffer execution starts. After execution finished state is > restored. > >> > >> What's the usecase for batch vs context flag? > In overall coherency can be programmed using RENDER_SURFACE_STATE (for statefull memory accesses) or using BTI.255 for stateless (see PRM: Data Cache Coherency). When HDC_FORCE_NON_COHERENT bit is clear, then all accesses are processed as described. By setting HDC_FORCE_NON_COHERENT to one we *Force* data port to ignore these attributes and caches are GPU-Coherent. However with stateless access mode IA-Coherency control is complicated. It would require to have ISA for each case, depending on required coherency for different surfaces. This requires ISA 'send' instruction with BTI.253 and BTI.255 mixed. Moreover, IA-Coherency requires additional effort from HW. Increased LLC usage or extra snooping for non-LLC. This costs additional power and causes performance penalties. So there is no reason to be always IA-Coherent. Per batch flags allows user space to select if single batch requires caches to be non-IA-Coherent. In effect, in single HW context we can work in both modes and have performance gain. Setting flag for context would mean that user space want to be non-IA-Coherent. For single batch which may require IA-Coherency, we need switch context back to IA-Coherent mode. User space can have two contexts, for GPU-Coherent and IA-Coherent, and use one of them during batch buffer submission. But this means context switching which is overkill in this case, I think. I did per-batch flag, because it seems to be simpler and minimizes risk of making bug. This allow to keep state of workaround. > >> > >>> When WaForceEnableNonCoherent is programmed, it does not make sense > >>> to allow for coherency because this can lead to GPU hangs. In such > >>> situation flag is ignored. > >>> > >>> Signed-off-by: Artur Harasimiuk <artur.harasimiuk@xxxxxxxxx> > >>> --- > >>> drivers/gpu/drm/i915/i915_dma.c | 3 +++ > >>> drivers/gpu/drm/i915/i915_drv.h | 4 ++++ > >>> drivers/gpu/drm/i915/i915_gem_execbuffer.c | 8 ++++++++ > >>> drivers/gpu/drm/i915/intel_lrc.c | 28 > >>> ++++++++++++++++++++++++++++ > >>> drivers/gpu/drm/i915/intel_ringbuffer.c | 7 +++++++ > >>> include/uapi/drm/i915_drm.h | 8 +++++++- > >>> 6 files changed, 57 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/gpu/drm/i915/i915_dma.c > >>> b/drivers/gpu/drm/i915/i915_dma.c index 44a896c..79ecf20 100644 > >>> --- a/drivers/gpu/drm/i915/i915_dma.c > >>> +++ b/drivers/gpu/drm/i915/i915_dma.c > >>> @@ -172,6 +172,9 @@ static int i915_getparam(struct drm_device *dev, > >>> void *data, > >>> case I915_PARAM_HAS_EXEC_SOFTPIN: > >>> value = 1; > >>> break; > >>> + case I915_PARAM_HAS_EXEC_FORCE_NON_COHERENT: > >>> + value = INTEL_INFO(dev)->gen >= 9; > >> > >> This should actually report the w/a status, or else you need to > >> provide some other means for userspace to detect if it can > >> successfully force non-coherency. > > It sounds like, we need to have way to pass w/a status from kernel to user Space? This parameter is to tell user space if it can request to disable IA-Coherency (Force). It tells user space about kernel capability. > > Agreed; especially with the naming used, something called > > *Force*Something shouldn't be reported as available but then silently > > ignored on use because some workaround is in effect. If user code > > knows that it wants to *force* the use of a particular mode then it > > has to know if it doesn't take effect. > > > > OTOH is it really "Force", or is it "Allow"? Or "Request", or "Prefer"? > > This means exactly "Force". It is equal to setting HDC_FORCE_NON_COHERENT bit. > >>> + break; > >>> default: > >>> DRM_DEBUG("Unknown parameter %d\n", param->param); > >>> return -EINVAL; > >>> diff --git a/drivers/gpu/drm/i915/i915_drv.h > >>> b/drivers/gpu/drm/i915/i915_drv.h index 104bd18..71d739c 100644 > >>> --- a/drivers/gpu/drm/i915/i915_drv.h > >>> +++ b/drivers/gpu/drm/i915/i915_drv.h > >>> @@ -886,6 +886,10 @@ struct intel_context { > >>> } engine[I915_NUM_RINGS]; > >>> > >>> struct list_head link; > >>> + > >>> + struct { > >>> + unsigned int WaForceEnableNonCoherent:1; > >>> + } wa; > >>> }; > >>> > >>> enum fb_op_origin { > >>> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c > >>> b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > >>> index d469c47..2997a58 100644 > >>> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c > >>> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > >>> @@ -1400,6 +1400,10 @@ i915_gem_do_execbuffer(struct drm_device > >>> *dev, void *data, > >>> if (!i915_gem_check_execbuffer(args)) > >>> return -EINVAL; > >>> > >>> + if ((args->flags & I915_EXEC_FORCE_NON_COHERENT) && > >>> + INTEL_INFO(dev)->gen < 9) > >>> + return -EINVAL; > >> > >> It would seem easier to simply ignore this request in all situations > >> where it doesn't apply, rather than the current select few. I'm not sure. Request to disable coherency for non gen9+ would mean: 1. bad user mode programming (did not checked I915_PARAM_HAS_EXEC_FORCE_NON_COHERENT). In this case we can skip flag. 2. out of sync with numbers - user mode thinks this is different flag. 3. ... > >> > >>> ret = validate_exec_list(dev, exec, args->buffer_count); > >>> if (ret) > >>> return ret; > >>> @@ -1494,6 +1498,10 @@ i915_gem_do_execbuffer(struct drm_device > >>> *dev, void *data, > >>> > >>> i915_gem_context_reference(ctx); > >>> > >>> + /* Clear this flag when WA is programmed */ > >>> + if (ctx->wa.WaForceEnableNonCoherent) > >>> + args->flags &= ~I915_EXEC_FORCE_NON_COHERENT; > >>> + > > > > This seems back-to-front (I'm only looking at the names, without > > checking what this bit really does), but surely if the > > "ForceEnableNonCoherent" workaround is in effect it should force the > > "FORCE_NON_COHERENT" flag ON, not OFF? > > When WaForceEnableNonCoherent is in action, this means HDC_FORCE_NON_COHERENT bit is set to one. This workaround is for possible GPU hangs. So, for devices requiring this workaround, we should keep this always set - do not allow to be IA-Coherent. By clearing this flag we keep code consistent. > > Also, the use of a per-batch bool doesn't seem to leave any don't-care > > option. Is there any use case where the user code absolutely *doesn't* > > want this non-coherency? Or which of these are useful: > > > > 1. Require non-coherent, fail if not possible > > 2. Request non-coherent, warn me but continue if not possible > > 3. Prefer non-coherent, ignore failure (don't even tell me) > > 4. Don't care, use system default > > 5. Prefer coherent, ignore failure (don't tell me) > > 6. Request coherent, warn me (but continue) if not possible > > 7. Require coherent, fail if not possible > > We don't need don't-care scenario, IMO. I think flags meaning is quite simple: 1. When flag is not set, this means use system default. No behavior change. 2. When set - disable IA-Coherency when enabled (wa not programmed) - do nothing, because coherency is already disabled (wa programmed) Looks like this should be in commit message. > > Then we just need to make sure the names mean what they would appear to. I agree. I will update commit message to better describe things depending on further input. > > > > .Dave. > > > >>> if (ctx->ppgtt) > >>> vm = &ctx->ppgtt->base; > >>> else > >>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c > >>> b/drivers/gpu/drm/i915/intel_lrc.c > >>> index ab344e0..4482a6a 100644 > >>> --- a/drivers/gpu/drm/i915/intel_lrc.c > >>> +++ b/drivers/gpu/drm/i915/intel_lrc.c > >>> @@ -879,6 +879,24 @@ int intel_logical_ring_reserve_space(struct > >>> drm_i915_gem_request *request) > >>> return intel_logical_ring_begin(request, 0); > >>> } > >>> > >>> +static inline void > >>> +intel_lr_emit_force_non_coherent(struct i915_execbuffer_params > *params, > >>> + struct drm_i915_gem_execbuffer2 *args, bool force) { > >>> + if (args->flags & I915_EXEC_FORCE_NON_COHERENT) { > >>> + struct intel_ringbuffer *ringbuf = > >>> + params->ctx->engine[params->ring->id].ringbuf; > >> > >> This should be using the request->ringbuf > >> > >>> + > >> > >> Missing ring_begin. > >> > >>> + intel_logical_ring_emit(ringbuf, MI_NOOP); > >>> + intel_logical_ring_emit(ringbuf, MI_LOAD_REGISTER_IMM(1)); > >>> + intel_logical_ring_emit(ringbuf, HDC_CHICKEN0.reg); > >> > >> intel_logical_ring_emit_reg(ringbuf, HDC_CHICKEN0); > >> > >>> + intel_logical_ring_emit(ringbuf, force ? > >>> + _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT) : > >>> + _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT)); > >>> + intel_logical_ring_advance(ringbuf); > >>> + } > >>> +} > >>> + > >>> /** > >>> * execlists_submission() - submit a batchbuffer for execution, > >>> Execlists style > >>> * @dev: DRM device. > >>> @@ -959,6 +977,8 @@ int intel_execlists_submission(struct > >>> i915_execbuffer_params *params, > >>> dev_priv->relative_constants_mode = instp_mode; > >>> } > >>> > >>> + intel_lr_emit_force_non_coherent(params, args, true); > >> > >> This is after cache invalidation. Do you need to send another cache > >> invalidate? > >> We do not need this. > >>> + > >>> exec_start = params->batch_obj_vm_offset + > >>> args->batch_start_offset; > >>> > >>> @@ -966,6 +986,8 @@ int intel_execlists_submission(struct > >>> i915_execbuffer_params *params, > >>> if (ret) > >>> return ret; > >>> > >>> + intel_lr_emit_force_non_coherent(params, args, false); > >> > >> You anticipate a context switching between modes between every batch? > >> This goes back to whether it is a context flag vs batch flag, and > >> whether we should cache the register value. > >> Because this is batch flag, we do not need any context switching. We don't need to cache register value because flag is cleared if workaround is programmed (@@ -1494,6 +1498,10 @@). This way we do not allow to clear HDC_FORCE_NON_COHERENT bit when workaround is required. > >>> + > >>> trace_i915_gem_ring_dispatch(params->request, > >>> params->dispatch_flags); > >>> > >>> i915_gem_execbuffer_move_to_active(vmas, params->request); @@ > >>> -1112,6 +1134,9 @@ static int > >>> intel_logical_ring_workarounds_emit(struct drm_i915_gem_request *req) > >>> struct drm_device *dev = ring->dev; > >>> struct drm_i915_private *dev_priv = dev->dev_private; > >>> struct i915_workarounds *w = &dev_priv->workarounds; > >>> + struct intel_context *ctx = req->ctx; > >>> + > >>> + ctx->wa.WaForceEnableNonCoherent = 0; > >>> > >>> if (w->count == 0) > >>> return 0; > >>> @@ -1129,6 +1154,9 @@ static int > >>> intel_logical_ring_workarounds_emit(struct drm_i915_gem_request *req) > >>> for (i = 0; i < w->count; i++) { > >>> intel_logical_ring_emit_reg(ringbuf, w->reg[i].addr); > >>> intel_logical_ring_emit(ringbuf, w->reg[i].value); > >>> + ctx->wa.WaForceEnableNonCoherent |= > >>> + (w->reg[i].addr.reg == HDC_CHICKEN0.reg) && > >>> + (w->reg[i].value & HDC_FORCE_NON_COHERENT); > >> > >> I don't like this second guessing of what w/a are being applied. We > >> could just as easily explicitly list the w/a on the engine (i.e. a > >> synopsis of the wa reg list, plus those setup globally) and reference > >> that instead of copying that flag to every context. > >> -Chris > >> > Right. Will re-work patch to keep information in one place. Thanks, Artur > > > > _______________________________________________ > > Intel-gfx mailing list > > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > http://lists.freedesktop.org/mailman/listinfo/intel-gfx _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx