Am 14.07.23 um 12:07 schrieb Asahi Lina:
On 14/07/2023 18.51, Christian König wrote:
Am 14.07.23 um 11:44 schrieb Asahi Lina:
On 14/07/2023 17.43, Christian König wrote:
Am 14.07.23 um 10:21 schrieb Asahi Lina:
A signaled scheduler fence can outlive its scheduler, since fences
are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.
Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.
Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx>
---
drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
drivers/gpu/drm/scheduler/sched_fence.c | 4 +++-
include/drm/gpu_scheduler.h | 5 +++++
3 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
/*
* Fence is from the same scheduler, only need to wait
for
- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and
reallocated
+ * as another scheduler. This false positive case is okay,
as if
+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.
This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.
So this function can't be called like this.
*/
fence = dma_fence_get(&s_fence->scheduled);
dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char
*drm_sched_fence_get_driver_name(struct dma_fence *fence)
static const char *drm_sched_fence_get_timeline_name(struct
dma_fence *f)
{
struct drm_sched_fence *fence = to_drm_sched_fence(f);
- return (const char *)fence->sched->name;
+ return (const char *)fence->sched_name;
}
static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
*fence,
unsigned seq;
fence->sched = entity->rq->sched;
+ strlcpy(fence->sched_name, entity->rq->sched->name,
+ sizeof(fence->sched_name));
seq = atomic_inc_return(&entity->fence_seq);
dma_fence_init(&fence->scheduled,
&drm_sched_fence_ops_scheduled,
&fence->lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h
b/include/drm/gpu_scheduler.h
index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
* @lock: the lock used by the scheduled and the finished
fences.
*/
spinlock_t lock;
+ /**
+ * @sched_name: the name of the scheduler that owns this
fence. We
+ * keep a copy here since fences can outlive their scheduler.
+ */
+ char sched_name[16];
This just mitigates the problem, but doesn't fix it.
Could you point out any remaining issues so we can fix them? Right now
this absolutely *is* broken and this fixes the breakage I observed. If
there are other bugs remaining, I'd like to know what they are so I
can fix them.
The real issue is that the hw fence is kept around much longer than
that.
As far as I know, the whole point of scheduler fences is to decouple
the hw fences from the consumers.
Well yes and no. The decoupling is for the signaling, it's not
decoupling the lifetime.
When I spoke with Daniel I understood the intent was also to decouple
the lifetime.
Yes, we discussed that a long long time ago as well.
We *wanted* to decouple the dma_fence lifetime, it's just not done that
way because of problems with that approach.
I already talked with Daniel about this. The current behavior is
broken. These fences can live forever. It is broken to require that
they outlive the driver that produced them.
Additional to that I'm not willing to increase the scheduler fence
size
like that just to decouple them from the scheduler.
Did you read my explanation on the cover letter as to how this is just
broken right now? We need to fix this. If you have a better suggestion
I'll take it. Doing nothing is not an option.
Well this isn't broken at all. This works exactly like intended, you
just want to use it for something it wasn't made for.
That scheduler fences could be changed to outlive the scheduler which
issued them is possible, but this is certainly a new requirement.
Especially since we need to grab additional references to make sure that
the module isn't unloaded in such a case.
Yes, that's a remaining issue. The fences need to grab a module
reference to make sure drm_sched doesn't get unloaded until they're
all really gone. I can add that in v2.
You also need to come up with an idea to prevent races with the deadline
handling. See drm_sched_fence_set_deadline_finished().
It would also be desirable to drop the hw fence as soon as it signals,
instead of keeping a reference to it forever.
Yeah, agree. Problem here again is that this is easier said than done in
a non-racy way.
Christian.
~~ Lina