Am 16.07.20 um 19:05 schrieb Felix Kuehling:
Am 2020-07-16 um 2:58 a.m. schrieb Christian König:
Am 15.07.20 um 17:14 schrieb Felix Kuehling:
Am 2020-07-15 um 5:28 a.m. schrieb Christian König:
[SNIP]
What could be problematic and result is an overrun is that TTM was
buggy and called put_node twice for the same memory.
So I've seen that the code needs fixing as well, but I'm not 100%
sure
how you ran into your problem.
This is in the KFD eviction test, which deliberately overcommits
VRAM in
order to trigger lots of evictions. It will use some GTT space
while BOs
are evicted. But shouldn't it move them further out of GTT and into
SYSTEM to free up GTT space?
Yes, exactly that should happen.
But for some reason it couldn't find a candidate to evict and the
14371 pages left are just a bit to small for the buffer.
That would be a nested eviction. A VRAM to GTT eviction requires a GTT
to SYSTEM eviction to make space in GTT. Is that even possible?
Yes, this is the core of the TTM design problem which I talked about
in my FOSDEM presentation in February.
Question do we still have this crude workaround that KFD is not taking
all reservations of the current process when allocating new BOs?
Not sure if you're referring to the workarounds we had to remove
eviction fences from reservations temporarily. Those are all gone. We're
making full use of the sync-object fence owner logic to avoid triggering
eviction fences unintentionally.
I was talking about this check here in amdgpu_ttm_bo_eviction_valuable():
/* If bo is a KFD BO, check if the bo belongs to the current
process.
* If true, then return false as any KFD process needs all its
BOs to
* be resident to run successfully
*/
flist = dma_resv_get_list(bo->base.resv);
if (flist) {
for (i = 0; i < flist->shared_count; ++i) {
f = rcu_dereference_protected(flist->shared[i],
dma_resv_held(bo->base.resv));
if (amdkfd_fence_check_mm(f, current->mm))
return false;
}
}
What can happen is that the allocating process owns to much of GTT as
well and as an end result we can't evict anything from GTT to allow for
VRAM eviction to happen.
I don't know why we would need to take all reservations when we allocate
a new BO. I'm probably misunderstanding you.
Taking all reservations when you change the set of BOs allocated in a
working context is mandatory for correct operation.
I've already noted multiple times that working around like we currently
do is just a hack and what you see here is one of the symptoms of this.
Regards,
Christian.
Regards,
Felix
That could maybe cause this as well.
Regards,
Christian.
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx