To avoid it, guard scratch page should be added after error_capture.
The patch fixes the most annoying issue with error capture but
since WC reads are used also in other places there is a risk similar
problem can affect them as well.
Signed-off-by: Andrzej Hajda <andrzej.hajda@xxxxxxxxx>
Reviewed-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
---
This patch tries to diminish plague of DMAR read errors present
in CI for ADL*, RPL*, DG2 platforms, see for example [1] (grep DMAR).
CI is usually tolerant for these errors, so the scale of the problem
is not really visible.
To show it I have counted lines containing DMAR read errors in dmesgs
produced by CI for all three versions of the patch, but in contrast
to v2
I have grepped only for lines containing "PTE Read access".
Below stats for kernel w/o patch vs patched one.
v1: 210 vs 0
v2: 201 vs 0
v3: 214 vs 0
Apparently the patch fixes all common PTE read errors.
In previous version there were different numbers due to less exact
grepping,
"grep DMAR" catched write errors and "DMAR: DRHD: handling fault
status reg"
lines, anyway the actual number of errors is much bigger - DMAR errors
are rate-limited.
[1]:
http://gfx-ci.igk.intel.com/tree/drm-tip/CI_DRM_12678/bat-adln-1/dmesg0.txt
Changelog:
v2:
- modified commit message (I hope the diagnosis is correct),
- added bug checks to ensure scratch is initialized on gen3
platforms.
CI produces strange stacktrace for it suggesting scratch[0]
is NULL,
to be removed after resolving the issue with gen3 platforms.
v3:
- removed bug checks, replaced with gen check.
v4:
- change code for scratch page insertion to support all platforms,
- add info in commit message there could be more similar issues
Regards
Andrzej
---
drivers/gpu/drm/i915/gt/intel_ggtt.c | 31
++++++++++++++++++++++++----
1 file changed, 27 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 842e69c7b21e49..6566d2066f1f8b 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -503,6 +503,21 @@ static void cleanup_init_ggtt(struct i915_ggtt
*ggtt)
mutex_destroy(&ggtt->error_mutex);
}
+static void
+ggtt_insert_scratch_pages(struct i915_ggtt *ggtt, u64 offset, u64
length)
+{
+ struct i915_address_space *vm = &ggtt->vm;
+
+ if (GRAPHICS_VER(ggtt->vm.i915) < 8)
+ return vm->clear_range(vm, offset, length);
+ /* clear_range since gen8 is nop */
+ while (length > 0) {
+ vm->insert_page(vm, px_dma(vm->scratch[0]), offset,
I915_CACHE_NONE, 0);
+ offset += I915_GTT_PAGE_SIZE;
+ length -= I915_GTT_PAGE_SIZE;
+ }
+}
+
static int init_ggtt(struct i915_ggtt *ggtt)
{
/*
@@ -551,8 +566,12 @@ static int init_ggtt(struct i915_ggtt *ggtt)
* paths, and we trust that 0 will remain reserved. However,
* the only likely reason for failure to insert is a driver
* bug, which we expect to cause other failures...
+ *
+ * Since CPU can perform speculative reads on error capture
+ * (write-combining allows it) add scratch page after error
+ * capture to avoid DMAR errors.
*/
- ggtt->error_capture.size = I915_GTT_PAGE_SIZE;
+ ggtt->error_capture.size = 2 * I915_GTT_PAGE_SIZE;
ggtt->error_capture.color = I915_COLOR_UNEVICTABLE;
if (drm_mm_reserve_node(&ggtt->vm.mm, &ggtt->error_capture))
drm_mm_insert_node_in_range(&ggtt->vm.mm,
@@ -562,11 +581,15 @@ static int init_ggtt(struct i915_ggtt *ggtt)
0, ggtt->mappable_end,
DRM_MM_INSERT_LOW);
}
- if (drm_mm_node_allocated(&ggtt->error_capture))
+ if (drm_mm_node_allocated(&ggtt->error_capture)) {
+ u64 start = ggtt->error_capture.start;
+ u64 size = ggtt->error_capture.size;
+
+ ggtt_insert_scratch_pages(ggtt, start, size);
drm_dbg(&ggtt->vm.i915->drm,
"Reserved GGTT:[%llx, %llx] for use by error capture\n",
- ggtt->error_capture.start,
- ggtt->error_capture.start + ggtt->error_capture.size);
+ start, start + size);
+ }
/*
* The upper portion of the GuC address space has a sizeable hole