The patch adds a parameter to control the data port coherency
functionality
on a per-context level. When the IOCTL is called, a command to switch
data
port coherency state is added to the ordered list. All prior requests
are
executed on old coherency settings, and all exec requests after the
IOCTL
will use new settings.
Rationale:
The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is
disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:
1. Why do we need a coherency enable/disable switch for memory that
is shared
between CPU and GEN (GPU)?
Memory coherency between CPU and GEN, while being a great feature
that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN
architecture, adds
overhead related to tracking (snooping) memory inside different cache
units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence
require
memory coherency between CPU and GPU). The goal of coherency
enable/disable
switch is to remove overhead of memory coherency when memory
coherency is not
needed.
2. Why do we need a global coherency switch?
In order to support I/O commands from within EUs (Execution Units),
Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send"
instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O
using plain
virtual addresses (as opposed to buffer_handle+offset models). This
"stateless"
model is similar to regular memory load/store operations available on
typical
CPUs. Since this model provides I/O using arbitrary virtual
addresses, it
enables algorithmic designs that are based on pointer-to-pointer
(e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
________________
| NODE1 |
| uint64_t data |
+----------------|
| NODE* | NODE*|
+--------+-------+
/ \
________________/ \________________
| NODE2 | | NODE3 |
| uint64_t data | | uint64_t data |
+----------------| +----------------|
| NODE* | NODE*| | NODE* | NODE*|
+--------+-------+ +--------+-------+
Please note that pointers inside such structures can point to memory
locations
in different OCL allocations - e.g. NODE1 and NODE2 can reside in
one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM -
Shared
Virtual Memory feature). Using pointers from different allocations
doesn't
affect the stateless addressing model which even allows scattered
reading from
different allocations at the same time (i.e. by utilizing SIMD-nature
of send
instructions).
When it comes to coherency programming, send instructions in
stateless model
can be encoded (at ISA level) to either use or disable coherency.
However, for
generic OCL applications (such as example with tree-like data
structure), OCL
compiler is not able to determine origin of memory pointed to by an
arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is
needed or not
for specific pointer (or for specific I/O instruction). As a result,
compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that
it would be
possible to workaround this (e.g. based on allocations map and
pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping
coherency
always enabled. As such, enabling/disabling memory coherency at GEN
ISA level
is not feasible and alternative method is needed.
Such alternative solution is to have a global coherency switch that
allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that
actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance
impact)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and
kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes
incoming
work descriptors and populates a plain OCL buffer (non-fine-grain)
with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and
processes
the payload that kern_master produced. These two kernels work in a
loop, one
after another. Since only kern_master requires coherency, kern_worker
should
not be forced to pay for it. This means that we need to have the
ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker ->
(ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
Removed redundant GEM_WARN_ON()s. Improved to coding standard.
Introduced a macro for checking whether hardware supports the
feature.
v5: Renamed some locals. Made the flag write to be lazy.
Updated comments to remove misconceptions. Added gen11 support.
Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Michal Winiarski <michal.winiarski@xxxxxxxxx>
Bspec: 11419
Bspec: 19175
Signed-off-by: Tomasz Lis <tomasz.lis@xxxxxxxxx>
---
drivers/gpu/drm/i915/i915_drv.h | 1 +
drivers/gpu/drm/i915/i915_gem_context.c | 29 +++++++++++++---
drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
drivers/gpu/drm/i915/i915_gem_execbuffer.c | 6 ++++
drivers/gpu/drm/i915/intel_lrc.c | 55
++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_lrc.h | 4 +++
include/uapi/drm/i915_drm.h | 7 ++++
7 files changed, 115 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.h
b/drivers/gpu/drm/i915/i915_drv.h
index 01dd298..73192e1 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private
*dev_priv)
#define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap &
EDRAM_ENABLED))
#define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
#define HWS_NEEDS_PHYSICAL(dev_priv)
((dev_priv)->info.hws_needs_physical)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
b/drivers/gpu/drm/i915/i915_gem_context.c
index b10770c..b5b63ac 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct
drm_device *dev, void *data,
int i915_gem_context_getparam_ioctl(struct drm_device *dev, void
*data,
struct drm_file *file)
{
+ struct drm_i915_private *i915 = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct
drm_device *dev, void *data,
case I915_CONTEXT_PARAM_GTT_SIZE:
if (ctx->ppgtt)
args->value = ctx->ppgtt->vm.total;
- else if (to_i915(dev)->mm.aliasing_ppgtt)
- args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
+ else if (i915->mm.aliasing_ppgtt)
+ args->value = i915->mm.aliasing_ppgtt->vm.total;
else
- args->value = to_i915(dev)->ggtt.vm.total;
+ args->value = i915->ggtt.vm.total;
break;
case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
args->value = i915_gem_context_no_error_capture(ctx);
@@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct
drm_device *dev, void *data,
case I915_CONTEXT_PARAM_PRIORITY:
args->value = ctx->sched.priority;
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (!HAS_DATA_PORT_COHERENCY(i915))
+ ret = -ENODEV;
+ else
+ args->value =
i915_gem_context_is_data_port_coherent_requested(ctx);