Am 16.03.25 um 16:36 schrieb Sathishkumar S: > Wait for vm page table updates to finish before resuming user queues. > Resume must follow after completion of pte updates to avoid page faults. > > amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:10 pasid:32771) > amdgpu: in process pid 0 thread pid 0) > amdgpu: in page starting at address 0x0000800105405000 from client 10 > amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00A41051 > amdgpu: Faulty UTCL2 client ID: TCP (0x8) > amdgpu: MORE_FAULTS: 0x1 > amdgpu: WALKER_ERROR: 0x0 > amdgpu: PERMISSION_FAULTS: 0x5 > amdgpu: MAPPING_ERROR: 0x0 > amdgpu: RW: 0x1 > amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:10 pasid:32771) > amdgpu: in process pid 0 thread pid 0) > amdgpu: in page starting at address 0x0000800105404000 from client 10 > > Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c > index f1d4e29772a5..4c3edd988a05 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c > @@ -541,10 +541,23 @@ amdgpu_userqueue_validate_bos(struct amdgpu_userq_mgr *uq_mgr) > static void amdgpu_userqueue_resume_worker(struct work_struct *work) > { > struct amdgpu_userq_mgr *uq_mgr = work_to_uq_mgr(work, resume_work.work); > + struct amdgpu_fpriv *fpriv = uq_mgr_to_fpriv(uq_mgr); > + struct amdgpu_eviction_fence_mgr *evf_mgr = &fpriv->evf_mgr; > + struct amdgpu_eviction_fence *ev_fence; > int ret; > > mutex_lock(&uq_mgr->userq_mutex); > > + spin_lock(&evf_mgr->ev_fence_lock); > + ev_fence = evf_mgr->ev_fence; > + spin_unlock(&evf_mgr->ev_fence_lock); > + if (ev_fence && dma_fence_is_signaled(&ev_fence->base)) { > + /* Wait for the prior vm updates to complete before proceeding with resume */ > + dma_resv_wait_timeout(fpriv->vm.root.bo->tbo.base.resv, > + DMA_RESV_USAGE_BOOKKEEP, > + true, > + msecs_to_jiffies(AMDGPU_FENCE_JIFFIES_TIMEOUT)); > + } In general I agree that we need for PTE updates before resuming userqueues, but this here is just nonsense. > ret = amdgpu_userqueue_validate_bos(uq_mgr); This call here is validating the BOs, updating the PTEs *and* making sure that we wait for this update to finish. Waiting before that just doesn't make any sense as far as I can see. Regards, Christian. > if (ret) { > DRM_ERROR("Failed to validate BOs to restore\n");