On 6/20/23 17:16, Tatsuyuki Ishi wrote:
On 6/20/23 17:12, Christian König wrote:
Am 20.06.23 um 06:07 schrieb Tatsuyuki Ishi:
@@ -928,18 +874,56 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
e->user_invalidated = userpage_invalidated;
}
- r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
- &duplicates);
- if (unlikely(r != 0)) {
- if (r != -ERESTARTSYS)
- DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
- goto out_free_user_pages;
+ drm_exec_while_not_all_locked(&p->exec) {
+ r = amdgpu_vm_lock_pd(&fpriv->vm, &p->exec);
+ drm_exec_continue_on_contention(&p->exec);
Duplicate handling is needed for pretty much every call of amdgpu_vm_lock_pd, as bo->tbo.base.resv == vm->root.bo->tbo.base.resv for AMDGPU_GEM_CREATE_VM_ALWAYS_VALID.
Well no. AMDGPU_GEM_CREATE_VM_ALWAYS_VALID means that BOs should *not* be part of the relocation list. So when those cause an EALREADY here then userspace has a bug.
Sounds fair, lemme check how RADV is handling this again.
I checked again and relocation list was actually fine, but other places were not. For example amdgpu_gem_object_close
locks both bo->tbo.base.resv and vm->root.bo->tbo.base.resv (PD) on its own.
This was the easily debuggable case since it caused an error log but some other BO operations on ALWAYS_VALID
is also presumably broken due to the same reason.
Tatsuyuki