Quoting Tomasz Lis (2018-10-12 18:02:56) > The patch adds a parameter to control the data port coherency functionality > on a per-context level. When the IOCTL is called, a command to switch data > port coherency state is added to the ordered list. All prior requests are > executed on old coherency settings, and all exec requests after the IOCTL > will use new settings. > > Rationale: > > The OpenCL driver develpers requested a functionality to control cache > coherency at data port level. Keeping the coherency at that level is disabled > by default due to its performance costs. OpenCL driver is planning to > enable it for a small subset of submissions, when such functionality is > required. Below are answers to basic question explaining background > of the functionality and reasoning for the proposed implementation: > > 1. Why do we need a coherency enable/disable switch for memory that is shared > between CPU and GEN (GPU)? > > Memory coherency between CPU and GEN, while being a great feature that enables > CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds > overhead related to tracking (snooping) memory inside different cache units > (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL > applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require > memory coherency between CPU and GPU). The goal of coherency enable/disable > switch is to remove overhead of memory coherency when memory coherency is not > needed. > > 2. Why do we need a global coherency switch? > > In order to support I/O commands from within EUs (Execution Units), Intel GEN > ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions. > These send instructions provide several addressing models. One of these > addressing models (named "stateless") provides most flexible I/O using plain > virtual addresses (as opposed to buffer_handle+offset models). This "stateless" > model is similar to regular memory load/store operations available on typical > CPUs. Since this model provides I/O using arbitrary virtual addresses, it > enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer > of pointers) concepts. For instance, it allows creating tree-like data > structures such as: > ________________ > | NODE1 | > | uint64_t data | > +----------------| > | NODE* | NODE*| > +--------+-------+ > / \ > ________________/ \________________ > | NODE2 | | NODE3 | > | uint64_t data | | uint64_t data | > +----------------| +----------------| > | NODE* | NODE*| | NODE* | NODE*| > +--------+-------+ +--------+-------+ > > Please note that pointers inside such structures can point to memory locations > in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL > allocation while NODE3 resides in a completely separate OCL allocation. > Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared > Virtual Memory feature). Using pointers from different allocations doesn't > affect the stateless addressing model which even allows scattered reading from > different allocations at the same time (i.e. by utilizing SIMD-nature of send > instructions). > > When it comes to coherency programming, send instructions in stateless model > can be encoded (at ISA level) to either use or disable coherency. However, for > generic OCL applications (such as example with tree-like data structure), OCL > compiler is not able to determine origin of memory pointed to by an arbitrary > pointer - i.e. is not able to track given pointer back to a specific > allocation. As such, it's not able to decide whether coherency is needed or not > for specific pointer (or for specific I/O instruction). As a result, compiler > encodes all stateless sends as coherent (doing otherwise would lead to > functional issues resulting from data corruption). Please note that it would be > possible to workaround this (e.g. based on allocations map and pointer bounds > checking prior to each I/O instruction) but the performance cost of such > workaround would be many times greater than the cost of keeping coherency > always enabled. As such, enabling/disabling memory coherency at GEN ISA level > is not feasible and alternative method is needed. > > Such alternative solution is to have a global coherency switch that allows > disabling coherency for single (though entire) GPU submission. This is > beneficial because this way we: > * can enable (and pay for) coherency only in submissions that actually need > coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources) > * don't care about coherency at GEN ISA granularity (no performance impact) Might be worthy mentioning that this address space compatibility can be achieved with userptr + soft-pinning allocations to their process space addresses. Anyway, this is; Reviewed-by: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx> But as mentioned previously, getting this merged needs the test to be finished and clarity from the userspace project side. Regards, Joonas > 3. Will coherency switch be used frequently? > > There are scenarios that will require frequent toggling of the coherency > switch. > E.g. an application has two OCL compute kernels: kern_master and kern_worker. > kern_master uses, concurrently with CPU, some fine grain SVM resources > (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of > computational work that needs to be executed. kern_master analyzes incoming > work descriptors and populates a plain OCL buffer (non-fine-grain) with payload > for kern_worker. Once kern_master is done, kern_worker kicks-in and processes > the payload that kern_master produced. These two kernels work in a loop, one > after another. Since only kern_master requires coherency, kern_worker should > not be forced to pay for it. This means that we need to have the ability to > toggle coherency switch on or off per each GPU submission: > (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE > COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ... > > v2: Fixed compilation warning. > v3: Refactored the patch to add IOCTL instead of exec flag. > v4: Renamed and documented the API flag. Used strict values. > Removed redundant GEM_WARN_ON()s. Improved to coding standard. > Introduced a macro for checking whether hardware supports the feature. > v5: Renamed some locals. Made the flag write to be lazy. > Updated comments to remove misconceptions. Added gen11 support. > v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions. > Moved all flags checking to one place. Added mutex check. > v7: Removed 2 comments, improved API comment. (Joonas) > v8: Use non-GEM WARN_ON when in statements. (Tvrtko) > > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx> > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Michal Winiarski <michal.winiarski@xxxxxxxxx> > > Bspec: 11419 > Bspec: 19175 > Signed-off-by: Tomasz Lis <tomasz.lis@xxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_drv.h | 1 + > drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++--- > drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++ > drivers/gpu/drm/i915/intel_lrc.c | 64 ++++++++++++++++++++++++++++++++- > include/uapi/drm/i915_drm.h | 10 ++++++ > 5 files changed, 116 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index 3017ef0..90b3a0ff 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv) > #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED)) > #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \ > IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv)) > +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9) > > #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical) > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c > index 8cbe580..718ede9 100644 > --- a/drivers/gpu/drm/i915/i915_gem_context.c > +++ b/drivers/gpu/drm/i915/i915_gem_context.c > @@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data, > int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, > struct drm_file *file) > { > + struct drm_i915_private *i915 = to_i915(dev); > struct drm_i915_file_private *file_priv = file->driver_priv; > struct drm_i915_gem_context_param *args = data; > struct i915_gem_context *ctx; > @@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, > case I915_CONTEXT_PARAM_GTT_SIZE: > if (ctx->ppgtt) > args->value = ctx->ppgtt->vm.total; > - else if (to_i915(dev)->mm.aliasing_ppgtt) > - args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total; > + else if (i915->mm.aliasing_ppgtt) > + args->value = i915->mm.aliasing_ppgtt->vm.total; > else > - args->value = to_i915(dev)->ggtt.vm.total; > + args->value = i915->ggtt.vm.total; > break; > case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE: > args->value = i915_gem_context_no_error_capture(ctx); > @@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, > case I915_CONTEXT_PARAM_PRIORITY: > args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT; > break; > + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY: > + if (!HAS_DATA_PORT_COHERENCY(i915)) > + ret = -ENODEV; > + else > + args->value = i915_gem_context_is_data_port_coherent(ctx); > + break; > default: > ret = -EINVAL; > break; > @@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, > int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data, > struct drm_file *file) > { > + struct drm_i915_private *i915 = to_i915(dev); > struct drm_i915_file_private *file_priv = file->driver_priv; > struct drm_i915_gem_context_param *args = data; > struct i915_gem_context *ctx; > @@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data, > > if (args->size) > ret = -EINVAL; > - else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY)) > + else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY)) > ret = -ENODEV; > else if (priority > I915_CONTEXT_MAX_USER_PRIORITY || > priority < I915_CONTEXT_MIN_USER_PRIORITY) > @@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data, > } > break; > > + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY: > + if (args->size) > + ret = -EINVAL; > + else if (!HAS_DATA_PORT_COHERENCY(i915)) > + ret = -ENODEV; > + else if (args->value == 1) > + i915_gem_context_set_data_port_coherent(ctx); > + else if (args->value == 0) > + i915_gem_context_clear_data_port_coherent(ctx); > + else > + ret = -EINVAL; > + break; > + > default: > ret = -EINVAL; > break; > diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h > index f6d870b..69f9247 100644 > --- a/drivers/gpu/drm/i915/i915_gem_context.h > +++ b/drivers/gpu/drm/i915/i915_gem_context.h > @@ -131,6 +131,8 @@ struct i915_gem_context { > #define CONTEXT_BANNED 0 > #define CONTEXT_CLOSED 1 > #define CONTEXT_FORCE_SINGLE_SUBMISSION 2 > +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 3 > +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 4 > > /** > * @hw_id: - unique identifier for the context > @@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx) > atomic_dec(&ctx->hw_id_pin_count); > } > > +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx) > +{ > + return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags); > +} > + > +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx) > +{ > + __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags); > +} > + > +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx) > +{ > + __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags); > +} > + > static inline bool i915_gem_context_is_default(const struct i915_gem_context *c) > { > return c->user_handle == DEFAULT_CONTEXT_HANDLE; > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index ff0e2b3..8680bc2 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx, > ce->lrc_desc = desc; > } > > +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable) > +{ > + u32 *cs; > + i915_reg_t reg; > + > + GEM_BUG_ON(rq->engine->class != RENDER_CLASS); > + GEM_BUG_ON(INTEL_GEN(rq->i915) < 9); > + > + cs = intel_ring_begin(rq, 4); > + if (IS_ERR(cs)) > + return PTR_ERR(cs); > + > + if (INTEL_GEN(rq->i915) >= 11) > + reg = ICL_HDC_MODE; > + else if (INTEL_GEN(rq->i915) >= 10) > + reg = CNL_HDC_CHICKEN0; > + else > + reg = HDC_CHICKEN0; > + > + *cs++ = MI_LOAD_REGISTER_IMM(1); > + *cs++ = i915_mmio_reg_offset(reg); > + if (enable) > + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT); > + else > + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT); > + *cs++ = MI_NOOP; > + > + intel_ring_advance(rq, cs); > + > + return 0; > +} > + > +static int > +intel_lr_context_update_data_port_coherency(struct i915_request *rq) > +{ > + struct i915_gem_context *ctx = rq->gem_context; > + bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags); > + int ret; > + > + lockdep_assert_held(&rq->i915->drm.struct_mutex); > + > + if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable) > + return 0; > + > + ret = emit_set_data_port_coherency(rq, enable); > + > + if (!ret) { > + if (enable) > + __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags); > + else > + __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags); > + } > + > + return ret; > +} > + > static void unwind_wa_tail(struct i915_request *rq) > { > rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES); > @@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request, > i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES; > bool vf_flush_wa = false, dc_flush_wa = false; > u32 *cs, flags = 0; > - int len; > + int err, len; > > flags |= PIPE_CONTROL_CS_STALL; > > @@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request, > /* WaForGAMHang:kbl */ > if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0)) > dc_flush_wa = true; > + > + err = intel_lr_context_update_data_port_coherency(request); > + if (WARN_ON(err)) { > + DRM_DEBUG("Data Port Coherency toggle failed.\n"); > + return err; > + } > } > > len = 6; > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h > index 298b2e1..7c9e153 100644 > --- a/include/uapi/drm/i915_drm.h > +++ b/include/uapi/drm/i915_drm.h > @@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param { > #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */ > #define I915_CONTEXT_DEFAULT_PRIORITY 0 > #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */ > +/* > + * When data port level coherency is enabled, the GPU and CPU will both keep > + * changes to memory content visible to each other as fast as possible, by > + * forcing internal cache units to send memory writes to higher level caches > + * immediately after writes. Only buffers with coherency requested within > + * surface state, or specific stateless accesses will be affected by this > + * option. Keeping data port coherency has a performance cost, and therefore > + * it is by default disabled (see WaForceEnableNonCoherent). > + */ > +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7 > __u64 value; > }; > > -- > 2.7.4 > _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx