On 12/04/2023 12:33, Andi Shyti wrote:
Currently, when we perform operations such as clearing or copying
large blocks of memory, we generate multiple requests that are
executed in a chain.
However, if one of these requests fails, we may not realize it
unless it happens to be the last request in the chain. This is
because errors are not properly propagated.
For this we need to keep propagating the chain of fence
notification in order to always reach the final fence associated
to the final request.
To address this issue, we need to ensure that the chain of fence
notifications is always propagated so that we can reach the final
fence associated with the last request. By doing so, we will be
able to detect any memory operation failures and determine
whether the memory is still invalid.
Above two paragraphs seems to have redundancy in the message they convey.
On copy and clear migration signal fences upon completion.
On copy and clear migration, signal fences upon request
completion to ensure that we have a reliable perpetuation of the
operation outcome.
These two too. So I think commit message can be a bit polished.
Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
Reported-by: Matthew Auld <matthew.auld@xxxxxxxxx>
Suggested-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Signed-off-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Reviewed-by: Matthew Auld <matthew.auld@xxxxxxxxx>
Acked-by: Nirmoy Das <nirmoy.das@xxxxxxxxx>
---
drivers/gpu/drm/i915/gt/intel_migrate.c | 51 +++++++++++++++++++------
1 file changed, 39 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 3f638f1987968..668c95af8cbcf 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce,
dst_offset = 2 * CHUNK_SZ;
}
+ /*
+ * While building the chain of requests, we need to ensure
+ * that no one can sneak into the timeline unnoticed.
+ */
+ mutex_lock(&ce->timeline->mutex);
+
do {
int len;
- rq = i915_request_create(ce);
+ rq = i915_request_create_locked(ce);
if (IS_ERR(rq)) {
err = PTR_ERR(rq);
- goto out_ce;
+ break;
}
if (deps) {
@@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce,
/* Arbitration is re-enabled between requests. */
out_rq:
- if (*out)
+ i915_sw_fence_await(&rq->submit);
+ i915_request_get(rq);
+ i915_request_add_locked(rq);
+ if (*out) {
+ i915_sw_fence_complete(&(*out)->submit);
i915_request_put(*out);
Could you help me understand this please. I have a few questions -
first, what are the actual mechanics of fence error transfer here? I see
the submit fence is being blocked until the next request is submitted -
effectively previous request is only allowed to get on the hardware
after the next one has been queued up. But I don't immediately see what
that does in practice.
Second question relates to the need to hold the timeline mutex
throughout. Presumably this is so two copy or migrate operations on the
same context do not interleave, which can otherwise happen?
Would the error propagation be doable without the lock held by chaining
on the previous request _completion_ fence? If so I am sure that would
have a performance impact, because chunk by chunk would need a GPU<->CPU
round trip to schedule. How much of an impact I don't know. Maybe
enlarging CHUNK_SZ to compensate is an option?
Or if the perf hit would be bearable for stable backports only (much
smaller patch) and then for tip we can do this full speed solution.
But yes, I would first want to understand the actual error propagation
mechanism because sadly my working knowledge is a bit rusty.
- *out = i915_request_get(rq);
- i915_request_add(rq);
+ }
+ *out = rq;
if (err)
break;
@@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce,
cond_resched();
} while (1);
-out_ce:
+ mutex_unlock(&ce->timeline->mutex);
+
+ if (*out)
+ i915_sw_fence_complete(&(*out)->submit);
return err;
}
@@ -999,13 +1012,19 @@ intel_context_migrate_clear(struct intel_context *ce,
if (HAS_64K_PAGES(i915) && is_lmem)
offset = CHUNK_SZ;
+ /*
+ * While building the chain of requests, we need to ensure
+ * that no one can sneak into the timeline unnoticed.
+ */
+ mutex_lock(&ce->timeline->mutex);
+
do {
int len;
- rq = i915_request_create(ce);
+ rq = i915_request_create_locked(ce);
if (IS_ERR(rq)) {
err = PTR_ERR(rq);
- goto out_ce;
+ break;
}
if (deps) {
@@ -1056,17 +1075,25 @@ intel_context_migrate_clear(struct intel_context *ce,
/* Arbitration is re-enabled between requests. */
out_rq:
- if (*out)
+ i915_sw_fence_await(&rq->submit);
+ i915_request_get(rq);
+ i915_request_add_locked(rq);
+ if (*out) {
+ i915_sw_fence_complete(&(*out)->submit);
i915_request_put(*out);
- *out = i915_request_get(rq);
- i915_request_add(rq);
+ }
+ *out = rq;
Btw if all else fails perhaps these two blocks can be consolidated by
something like __chain_requests(rq, out) and all these operations in it.
Not sure how much would that save in the grand total.
Regards,
Tvrtko
+
if (err || !it.sg || !sg_dma_len(it.sg))
break;
cond_resched();
} while (1);
-out_ce:
+ mutex_unlock(&ce->timeline->mutex);
+
+ if (*out)
+ i915_sw_fence_complete(&(*out)->submit);
return err;
}