Drop this patch as a new refined one was just sent out. Regards, Evan > -----Original Message----- > From: Quan, Evan > Sent: 2019年3月15日 17:15 > To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: RE: [PATCH] drm/amdgpu: error handling for xgmi and ras > > Hw_init -> psp_hw_init -> psp_hw_start (failed) -> ras_pre_fini -> hw_fini -> > psp_hw_fini > > On psp_hw_start failure, it destroys the cmd/fence/fw buffers. But the > ras/xgmi deinit was delayed to psp_hw_fini. > > In ras_pre_fini, it calls the ras APIs to disable previously enabled ras features. > And that needs the cmd/fence buffers and trigger a NULL pointer > dereference. > > Maybe "defer the cmd/fence/fw buffers destroys to psp_hw_fini also" is a > more proper way.. > > Regards, > Evan > > -----Original Message----- > > From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > > Sent: 2019年3月15日 16:40 > > To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Subject: Re: [PATCH] drm/amdgpu: error handling for xgmi and ras > > > > Am 15.03.19 um 09:33 schrieb Evan Quan: > > > This is a quick workaround. A more complete error handling around > > > psp_hw_start is actually needed. > > > > > > Change-Id: I398efd652584e022debf237950207199a4ea78fc > > > Signed-off-by: Evan Quan <evan.quan@xxxxxxx> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 +++++++++ > > > 1 file changed, 9 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > > index 5888e24219d9..f8d712d306f1 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > > @@ -941,6 +941,11 @@ static int psp_load_fw(struct amdgpu_device > > *adev) > > > amdgpu_bo_free_kernel(&psp->fw_pri_bo, > > > &psp->fw_pri_mc_addr, &psp->fw_pri_buf); > > > failed: > > > + if (psp->ras.ras_initialized) > > > + psp->ras.ras_initialized = 0; > > > + if (psp->ras.ras_initialized) > > > + psp->ras.ras_initialized = 0; > > > > No idea what that code does, but it looks really odd. Just copy&pasted > > typo maybe? > > > > A simple "psp->ras.ras_initialized = 0;" in the error path should have > > the same effect. > > > > Christian. > > > > > + > > > kfree(psp->cmd); > > > psp->cmd = NULL; > > > return ret; > > > @@ -1061,6 +1066,10 @@ static int psp_resume(void *handle) > > > > > > failed: > > > DRM_ERROR("PSP resume failed\n"); > > > + if (psp->ras.ras_initialized) > > > + psp->ras.ras_initialized = 0; > > > + if (psp->ras.ras_initialized) > > > + psp->ras.ras_initialized = 0; > > > mutex_unlock(&adev->firmware.mutex); > > > return ret; > > > } _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx