Re: [PATCH v8] drm/i915: Add IOCTL Param to control data port coherency.

Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx> · Tue, 16 Oct 2018 16:59:55 +0300

Quoting Tomasz Lis (2018-10-12 18:02:56)
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                    ________________
>                   |      NODE1     |
>                   | uint64_t data  |
>                   +----------------|
>                   | NODE*  |  NODE*|
>                   +--------+-------+
>                     /              \
>    ________________/                \________________
>   |      NODE2     |                |      NODE3     |
>   | uint64_t data  |                | uint64_t data  |
>   +----------------|                +----------------|
>   | NODE*  |  NODE*|                | NODE*  |  NODE*|
>   +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)

Might be worthy mentioning that this address space compatibility can be
achieved with userptr + soft-pinning allocations to their process space
addresses.

Anyway, this is;

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>

But as mentioned previously, getting this merged needs the test to be
finished and clarity from the userspace project side.

Regards, Joonas

> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>     Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>     Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
>     Updated comments to remove misconceptions. Added gen11 support.
> v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
>     Moved all flags checking to one place. Added mutex check.
> v7: Removed 2 comments, improved API comment. (Joonas)
> v8: Use non-GEM WARN_ON when in statements. (Tvrtko)
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Michal Winiarski <michal.winiarski@xxxxxxxxx>
> 
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@xxxxxxxxx>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  1 +
>  drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
>  drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
>  drivers/gpu/drm/i915/intel_lrc.c        | 64 ++++++++++++++++++++++++++++++++-
>  include/uapi/drm/i915_drm.h             | 10 ++++++
>  5 files changed, 116 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 3017ef0..90b3a0ff 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>  #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>  #define HAS_WT(dev_priv)       ((IS_HASWELL(dev_priv) || \
>                                  IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)      (INTEL_GEN(dev_priv) >= 9)
>  
>  #define HWS_NEEDS_PHYSICAL(dev_priv)   ((dev_priv)->info.hws_needs_physical)
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8cbe580..718ede9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>                                     struct drm_file *file)
>  {
> +       struct drm_i915_private *i915 = to_i915(dev);
>         struct drm_i915_file_private *file_priv = file->driver_priv;
>         struct drm_i915_gem_context_param *args = data;
>         struct i915_gem_context *ctx;
> @@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>         case I915_CONTEXT_PARAM_GTT_SIZE:
>                 if (ctx->ppgtt)
>                         args->value = ctx->ppgtt->vm.total;
> -               else if (to_i915(dev)->mm.aliasing_ppgtt)
> -                       args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> +               else if (i915->mm.aliasing_ppgtt)
> +                       args->value = i915->mm.aliasing_ppgtt->vm.total;
>                 else
> -                       args->value = to_i915(dev)->ggtt.vm.total;
> +                       args->value = i915->ggtt.vm.total;
>                 break;
>         case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>                 args->value = i915_gem_context_no_error_capture(ctx);
> @@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>         case I915_CONTEXT_PARAM_PRIORITY:
>                 args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
>                 break;
> +       case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +               if (!HAS_DATA_PORT_COHERENCY(i915))
> +                       ret = -ENODEV;
> +               else
> +                       args->value = i915_gem_context_is_data_port_coherent(ctx);
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> @@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>  int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>                                     struct drm_file *file)
>  {
> +       struct drm_i915_private *i915 = to_i915(dev);
>         struct drm_i915_file_private *file_priv = file->driver_priv;
>         struct drm_i915_gem_context_param *args = data;
>         struct i915_gem_context *ctx;
> @@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>  
>                         if (args->size)
>                                 ret = -EINVAL;
> -                       else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> +                       else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
>                                 ret = -ENODEV;
>                         else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>                                  priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>                 }
>                 break;
>  
> +       case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +               if (args->size)
> +                       ret = -EINVAL;
> +               else if (!HAS_DATA_PORT_COHERENCY(i915))
> +                       ret = -ENODEV;
> +               else if (args->value == 1)
> +                       i915_gem_context_set_data_port_coherent(ctx);
> +               else if (args->value == 0)
> +                       i915_gem_context_clear_data_port_coherent(ctx);
> +               else
> +                       ret = -EINVAL;
> +               break;
> +
>         default:
>                 ret = -EINVAL;
>                 break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index f6d870b..69f9247 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -131,6 +131,8 @@ struct i915_gem_context {
>  #define CONTEXT_BANNED                 0
>  #define CONTEXT_CLOSED                 1
>  #define CONTEXT_FORCE_SINGLE_SUBMISSION        2
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED   3
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE      4
>  
>         /**
>          * @hw_id: - unique identifier for the context
> @@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
>         atomic_dec(&ctx->hw_id_pin_count);
>  }
>  
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
>  static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>  {
>         return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ff0e2b3..8680bc2 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>         ce->lrc_desc = desc;
>  }
>  
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> +       u32 *cs;
> +       i915_reg_t reg;
> +
> +       GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> +       GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> +       cs = intel_ring_begin(rq, 4);
> +       if (IS_ERR(cs))
> +               return PTR_ERR(cs);
> +
> +       if (INTEL_GEN(rq->i915) >= 11)
> +               reg = ICL_HDC_MODE;
> +       else if (INTEL_GEN(rq->i915) >= 10)
> +               reg = CNL_HDC_CHICKEN0;
> +       else
> +               reg = HDC_CHICKEN0;
> +
> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(reg);
> +       if (enable)
> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +       else
> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +       *cs++ = MI_NOOP;
> +
> +       intel_ring_advance(rq, cs);
> +
> +       return 0;
> +}
> +
> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> +       struct i915_gem_context *ctx = rq->gem_context;
> +       bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +       int ret;
> +
> +       lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> +       if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> +               return 0;
> +
> +       ret = emit_set_data_port_coherency(rq, enable);
> +
> +       if (!ret) {
> +               if (enable)
> +                       __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +               else
> +                       __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +       }
> +
> +       return ret;
> +}
> +
>  static void unwind_wa_tail(struct i915_request *rq)
>  {
>         rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
> @@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
>                 i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
>         bool vf_flush_wa = false, dc_flush_wa = false;
>         u32 *cs, flags = 0;
> -       int len;
> +       int err, len;
>  
>         flags |= PIPE_CONTROL_CS_STALL;
>  
> @@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
>                 /* WaForGAMHang:kbl */
>                 if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
>                         dc_flush_wa = true;
> +
> +               err = intel_lr_context_update_data_port_coherency(request);
> +               if (WARN_ON(err)) {
> +                       DRM_DEBUG("Data Port Coherency toggle failed.\n");
> +                       return err;
> +               }
>         }
>  
>         len = 6;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 298b2e1..7c9e153 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
>  #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>  #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>  #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU and CPU will both keep
> + * changes to memory content visible to each other as fast as possible, by
> + * forcing internal cache units to send memory writes to higher level caches
> + * immediately after writes. Only buffers with coherency requested within
> + * surface state, or specific stateless accesses will be affected by this
> + * option. Keeping data port coherency has a performance cost, and therefore
> + * it is by default disabled (see WaForceEnableNonCoherent).
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>         __u64 value;
>  };
>  
> -- 
> 2.7.4
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx