On 2021-10-27 3:58 p.m., Andrey Grodzovsky wrote:
On 2021-10-27 10:50 a.m., Christian König wrote:
Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky:
On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]
Let me please know if I am still missing some point of yours.
Well, I mean we need to be able to handle this for all drivers.
For sure, but as i said above in my opinion we need to change only
for those drivers that don't use the _locked version.
And that absolutely won't work.
See the dma_fence is a contract between drivers, so you need the
same calling convention between all drivers.
Either we always call the callback with the lock held or we always
call it without the lock, but sometimes like that and sometimes
otherwise won't work.
Christian.
I am not sure I fully understand what problems this will cause but
anyway, then we are back to irq_work. We cannot embed irq_work as
union within dma_fenc's cb_list
because it's already reused as timestamp and as rcu head after the
fence is signaled. So I will do it within drm_scheduler with single
irq_work per drm_sched_entity
as we discussed before.
That won't work either. We free up the entity after the cleanup
function. That's the reason we use the callback on the job in the
first place.
Yep, missed it.
We could overlead the cb structure in the job though.
I guess, since no one else is using this member it after the cb executed.
Andrey
Attached a patch. Give it a try please, I tested it on my side and tried
to generate the right conditions to trigger this code path by repeatedly
submitting commands while issuing GPU reset to stop the scheduler and
then killing command submissions process in the middle. But for some
reason looks like the job_queue was always empty already at the time of
entity kill.
Andrey
Christian.
Andrey
Andrey
>From 8ba5c089939b79a6567411c33d4db40e5846eef3 Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Date: Thu, 28 Oct 2021 12:24:03 -0400
Subject: [PATCH] drm/sched: Avoid lockdep spalt on killing a processes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Probelm:
Singlaning one sched fence from within another's sched
fence singal callback generates lockdep splat because
the both have same lockdep class of their fence->lock
Fix:
Fix bellow stack by rescheduling to irq work of
signaling and killing of jobs that left when entity is killed.
[11176.741181] dump_stack+0x10/0x12
[11176.741186] __lock_acquire.cold+0x208/0x2df
[11176.741197] lock_acquire+0xc6/0x2d0
[11176.741204] ? dma_fence_signal+0x28/0x80
[11176.741212] _raw_spin_lock_irqsave+0x4d/0x70
[11176.741219] ? dma_fence_signal+0x28/0x80
[11176.741225] dma_fence_signal+0x28/0x80
[11176.741230] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[11176.741240] drm_sched_entity_kill_jobs_cb+0x1c/0x50 [gpu_sched]
[11176.741248] dma_fence_signal_timestamp_locked+0xac/0x1a0
[11176.741254] dma_fence_signal+0x3b/0x80
[11176.741260] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[11176.741268] drm_sched_job_done.isra.0+0x7f/0x1a0 [gpu_sched]
[11176.741277] drm_sched_job_done_cb+0x12/0x20 [gpu_sched]
[11176.741284] dma_fence_signal_timestamp_locked+0xac/0x1a0
[11176.741290] dma_fence_signal+0x3b/0x80
[11176.741296] amdgpu_fence_process+0xd1/0x140 [amdgpu]
[11176.741504] sdma_v4_0_process_trap_irq+0x8c/0xb0 [amdgpu]
[11176.741731] amdgpu_irq_dispatch+0xce/0x250 [amdgpu]
[11176.741954] amdgpu_ih_process+0x81/0x100 [amdgpu]
[11176.742174] amdgpu_irq_handler+0x26/0xa0 [amdgpu]
[11176.742393] __handle_irq_event_percpu+0x4f/0x2c0
[11176.742402] handle_irq_event_percpu+0x33/0x80
[11176.742408] handle_irq_event+0x39/0x60
[11176.742414] handle_edge_irq+0x93/0x1d0
[11176.742419] __common_interrupt+0x50/0xe0
[11176.742426] common_interrupt+0x80/0x90
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Suggested-by: Daniel Vetter <daniel.vetter@xxxxxxxx>
Suggested-by: Christian König <christian.koenig@xxxxxxx>
---
drivers/gpu/drm/scheduler/sched_entity.c | 15 ++++++++++++---
include/drm/gpu_scheduler.h | 12 +++++++++++-
2 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 27e1573af96e..191c56064f19 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -190,6 +190,16 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
}
EXPORT_SYMBOL(drm_sched_entity_flush);
+static void drm_sched_entity_kill_jobs_irq_work(struct irq_work *wrk)
+{
+ struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
+
+ drm_sched_fence_finished(job->s_fence);
+ WARN_ON(job->s_fence->parent);
+ job->sched->ops->free_job(job);
+}
+
+
/* Signal the scheduler finished fence when the entity in question is killed. */
static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb)
@@ -197,9 +207,8 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
finish_cb);
- drm_sched_fence_finished(job->s_fence);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
+ init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work);
+ irq_work_queue(&job->work);
}
static struct dma_fence *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index f011e4c407f2..bbc22fad8d80 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -28,6 +28,7 @@
#include <linux/dma-fence.h>
#include <linux/completion.h>
#include <linux/xarray.h>
+#include <linux/irq_work.h>
#define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
@@ -286,7 +287,16 @@ struct drm_sched_job {
struct list_head list;
struct drm_gpu_scheduler *sched;
struct drm_sched_fence *s_fence;
- struct dma_fence_cb finish_cb;
+
+ /*
+ * work is used only after finish_cb has been used and will not be
+ * accessed anymore.
+ */
+ union {
+ struct dma_fence_cb finish_cb;
+ struct irq_work work;
+ };
+
uint64_t id;
atomic_t karma;
enum drm_sched_priority s_priority;
--
2.25.1