Hi Christian: So the allocation information of a bo's vram page could be reserved after vram lost? In fact, this series could be dropped if the result of amdgpu_bo_create_reserved is guaranteed to be failed after vram lost (protect bad pages from reallocating). Hi Andrey: A bad page's allocation information will not be lost in gpu reset according to Christian's comments. Do you have any other concern? Regards, Tao > -----Original Message----- > From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > Sent: 2019年9月30日 16:35 > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun > <Guchun.Chen@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx> > Subject: Re: [PATCH 1/3] drm/amdgpu: recreate retired page's bo if vram get > lost in gpu reset > > Am 30.09.19 um 05:19 schrieb Zhou1, Tao: > > the info of retired page's bo may be lost if vram lost is encountered > > in gpu reset (gpu page table in vram may be lost), force to recreate > > all bos > > > > v2: simplify NULL pointer check > > add more comments > > Repeating on the v2 of the patch set, this is complete nonsense. > > When VRAM is lost the BOs and their reservation still exits, only the content > is lost because the MC is has been resetted. > > Regards, > Christian. > > > > > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> > > Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 48 > ++++++++++++++++++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + > > 3 files changed, 50 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 62955156653c..a89f6d053b3f 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -3675,6 +3675,7 @@ static int amdgpu_do_asic_reset(struct > amdgpu_hive_info *hive, > > if (vram_lost) { > > DRM_INFO("VRAM is lost due to GPU > reset!\n"); > > amdgpu_inc_vram_lost(tmp_adev); > > + > amdgpu_ras_recovery_reset(tmp_adev); > > } > > > > r = amdgpu_gtt_mgr_recover( > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > index 486568ded6d6..3f2a2f13e4c6 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > @@ -1573,6 +1573,54 @@ static int amdgpu_ras_recovery_fini(struct > > amdgpu_device *adev) > > > > return 0; > > } > > + > > +/* > > + * the info of retired page's bo may be lost if vram lost is > > +encountered > > + * in gpu reset (gpu page table in vram may be lost), force to > > +recreate > > + * all bos > > + */ > > +void amdgpu_ras_recovery_reset(struct amdgpu_device *adev) { > > + struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > > + struct ras_err_handler_data *data; > > + uint64_t bp; > > + struct amdgpu_bo *bo = NULL; > > + int i; > > + > > + if (!con || !con->eh_data) > > + return; > > + > > + /* Note: It's called in gpu reset, recovery_lock may be acquired > before > > + * gpu reset. The following code is out of protect of con- > >recovery_lock > > + * in case of deadlock and bps_bo should be recreated (if successfully) > > + * whether recovery_lock is locked or not. > > + * We assume there is no other ras recovery operation simultaneous > during > > + * gpu reset. > > + */ > > + data = con->eh_data; > > + /* force to reserve all retired page again */ > > + data->last_reserved = 0; > > + > > + for (i = data->last_reserved; i < data->count; i++) { > > + bp = data->bps[i].retired_page; > > + > > + /* There are three cases of reserve error should be ignored: > > + * 1) a ras bad page has been allocated (used by someone); > > + * 2) a ras bad page has been reserved (duplicate error > injection > > + * for one page); > > + * 3) bo info isn't lost in gpu reset > > + */ > > + if (amdgpu_bo_create_kernel_at(adev, bp << > AMDGPU_GPU_PAGE_SHIFT, > > + AMDGPU_GPU_PAGE_SIZE, > > + AMDGPU_GEM_DOMAIN_VRAM, > > + &bo, NULL)) > > + DRM_NOTE("RAS NOTE: recreate bo for retired page > 0x%llx fail\n", bp); > > + else > > + data->bps_bo[i] = bo; > > + data->last_reserved = i + 1; > > + bo = NULL; > > + } > > +} > > /* recovery end */ > > > > /* return 0 if ras will reset gpu and repost.*/ diff --git > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > index f80fd3428c98..7a606d3be806 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > @@ -479,6 +479,7 @@ static inline int amdgpu_ras_is_supported(struct > amdgpu_device *adev, > > } > > > > int amdgpu_ras_recovery_init(struct amdgpu_device *adev); > > +void amdgpu_ras_recovery_reset(struct amdgpu_device *adev); > > int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev, > > unsigned int block); > > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx