RE: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages

"Zhou1, Tao" <Tao.Zhou1@xxxxxxx> · Fri, 15 Nov 2019 02:42:17 +0000

Sorry, I confused reserve_bad_pages with add_bad_pages, you're right.

If vram allocated info could be reserved after gpu reset, it seems your patch is unnecessary.
If there is a risk that the info is lost, I think [data->0, data->count) pages instead of only [data->last_reserved, data->count) pages should be reserved.

Tao

> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
> Sent: 2019年11月15日 0:42
> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
> Koenig, Christian <Christian.Koenig@xxxxxxx>
> Cc: alexdeucher@xxxxxxxxx; Chen, Guchun <Guchun.Chen@xxxxxxx>;
> Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> Subject: Re: [PATCH 1/2] drm/amdgpu/ras: Extract
> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> 
> 
> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
> > Two questions:
> >
> > 1. "we lose all reservation during ASIC reset"
> > Are you sure of it? I remember the content of vram may be lost after reset
> but the allocated status could be reserved.
> 
> Yea, now that I am thinking of it I think i might have confused it with BO
> content recovery in amdgpu_device_recover_vram for shadow buffers which
> are page tables only but just for VRAM reservation  status this might be
> irrelevant... Christian - can you confirm Tao is correct on this ?
> 
> >
> > 2. You change the bad page handle flow from:
> >
> > detect bad page -> reserve vram for bad page -> save bad page info to
> > eeprom -> gpu reset
> >
> > to:
> >
> > detect bad page (this time) -> save bad page (last time) info to
> > eeprom -> gpu reset -> reserve vram for bad page (this time)
> 
> Even though if I am wrong on the first point this is irrelevant but still - Why
> saving bad page is from last time ? See
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdg
> pu/amdgpu_ras.c?h=amd-staging-drm-next#n1436
> - the save count is the latest so as the content of
> data->bps[control->num_recs] up to data->bps[control->num_recs +
> save_count] as those are updated in
> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is
> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages
> in the interrupt sequence
> 
> Andrey
> 
> 
> >
> > Is that right?  Saving bad page (this time) info to eeprom is delayed to the
> next time that bad page is detected? But the time of detecting bad page is
> random.
> > I think the bad page info would be lost without saving to eeprom if power
> off occurs.
> >
> > detect bad page (this time) -> save bad page (last time) info to
> > eeprom -> gpu reset -> reserve vram for bad page (this time) ->
> > poweroff/system reset (and bad page info (this time) is lost)
> >
> > Tao
> >
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
> >> Andrey Grodzovsky
> >> Sent: 2019年11月14日 6:45
> >> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >> Cc: alexdeucher@xxxxxxxxx; Grodzovsky, Andrey
> >> <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun
> <Guchun.Chen@xxxxxxx>;
> >> Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> >> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
> >> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> >>
> >> We want to be be able to call them independently from one another so
> >> that before GPU reset just amdgpu_ras_save_bad_pages is called.
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
> >>   2 files changed, 7 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> index 4044834..600a86d 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
> >> amdgpu_device *adev,
> >>    * write error record array to eeprom, the function should be
> >>    * protected by recovery_lock
> >>    */
> >> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >>   {
> >>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >>   	struct ras_err_handler_data *data; @@ -1520,7 +1520,7 @@ int
> >> amdgpu_ras_reserve_bad_pages(struct
> >> amdgpu_device *adev)
> >>   	struct ras_err_handler_data *data;
> >>   	uint64_t bp;
> >>   	struct amdgpu_bo *bo = NULL;
> >> -	int i, ret = 0;
> >> +	int i;
> >>
> >>   	if (!con || !con->eh_data)
> >>   		return 0;
> >> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
> >> amdgpu_device *adev)
> >>   		data->last_reserved = i + 1;
> >>   		bo = NULL;
> >>   	}
> >> -
> >> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
> >> -	ret = amdgpu_ras_save_bad_pages(adev);
> >>   out:
> >>   	mutex_unlock(&con->recovery_lock);
> >> -	return ret;
> >> +	return 0;
> >>   }
> >>
> >>   /* called when driver unload */
> >> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
> >> amdgpu_device *adev)
> >>   		ret = amdgpu_ras_load_bad_pages(adev);
> >>   		if (ret)
> >>   			goto free;
> >> -		ret = amdgpu_ras_reserve_bad_pages(adev);
> >> -		if (ret)
> >> -			goto release;
> >> +		amdgpu_ras_reserve_bad_pages(adev);
> >>   	}
> >>
> >>   	return 0;
> >>
> >> -release:
> >>   	amdgpu_ras_release_bad_pages(adev);
> >>   free:
> >>   	kfree((*data)->bps);
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> index f80fd34..12b0797 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> @@ -492,6 +492,8 @@ unsigned long
> >> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
> >> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >>   		struct eeprom_table_record *bps, int pages);
> >>
> >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> >> +
> >>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
> >>
> >>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
> >> @@ -
> >> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
> >> amdgpu_device *adev,
> >>   	 * i2c may be unstable in gpu reset
> >>   	 */
> >>   	if (in_task())
> >> -		amdgpu_ras_reserve_bad_pages(adev);
> >> +		amdgpu_ras_save_bad_pages(adev);
> >>
> >>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> >>   		schedule_work(&ras->recovery_work);
> >> --
> >> 2.7.4
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx