Sorry, I confused reserve_bad_pages with add_bad_pages, you're right. If vram allocated info could be reserved after gpu reset, it seems your patch is unnecessary. If there is a risk that the info is lost, I think [data->0, data->count) pages instead of only [data->last_reserved, data->count) pages should be reserved. Tao > -----Original Message----- > From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> > Sent: 2019年11月15日 0:42 > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > Koenig, Christian <Christian.Koenig@xxxxxxx> > Cc: alexdeucher@xxxxxxxxx; Chen, Guchun <Guchun.Chen@xxxxxxx>; > Zhang, Hawking <Hawking.Zhang@xxxxxxx> > Subject: Re: [PATCH 1/2] drm/amdgpu/ras: Extract > amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages > > > On 11/13/19 10:09 PM, Zhou1, Tao wrote: > > Two questions: > > > > 1. "we lose all reservation during ASIC reset" > > Are you sure of it? I remember the content of vram may be lost after reset > but the allocated status could be reserved. > > Yea, now that I am thinking of it I think i might have confused it with BO > content recovery in amdgpu_device_recover_vram for shadow buffers which > are page tables only but just for VRAM reservation status this might be > irrelevant... Christian - can you confirm Tao is correct on this ? > > > > > 2. You change the bad page handle flow from: > > > > detect bad page -> reserve vram for bad page -> save bad page info to > > eeprom -> gpu reset > > > > to: > > > > detect bad page (this time) -> save bad page (last time) info to > > eeprom -> gpu reset -> reserve vram for bad page (this time) > > Even though if I am wrong on the first point this is irrelevant but still - Why > saving bad page is from last time ? See > https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdg > pu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 > - the save count is the latest so as the content of > data->bps[control->num_recs] up to data->bps[control->num_recs + > save_count] as those are updated in > amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is > called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages > in the interrupt sequence > > Andrey > > > > > > Is that right? Saving bad page (this time) info to eeprom is delayed to the > next time that bad page is detected? But the time of detecting bad page is > random. > > I think the bad page info would be lost without saving to eeprom if power > off occurs. > > > > detect bad page (this time) -> save bad page (last time) info to > > eeprom -> gpu reset -> reserve vram for bad page (this time) -> > > poweroff/system reset (and bad page info (this time) is lost) > > > > Tao > > > >> -----Original Message----- > >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > >> Andrey Grodzovsky > >> Sent: 2019年11月14日 6:45 > >> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >> Cc: alexdeucher@xxxxxxxxx; Grodzovsky, Andrey > >> <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun > <Guchun.Chen@xxxxxxx>; > >> Zhang, Hawking <Hawking.Zhang@xxxxxxx> > >> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract > >> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages > >> > >> We want to be be able to call them independently from one another so > >> that before GPU reset just amdgpu_ras_save_bad_pages is called. > >> > >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++---------- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++- > >> 2 files changed, 7 insertions(+), 11 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >> index 4044834..600a86d 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct > >> amdgpu_device *adev, > >> * write error record array to eeprom, the function should be > >> * protected by recovery_lock > >> */ > >> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev) > >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev) > >> { > >> struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > >> struct ras_err_handler_data *data; @@ -1520,7 +1520,7 @@ int > >> amdgpu_ras_reserve_bad_pages(struct > >> amdgpu_device *adev) > >> struct ras_err_handler_data *data; > >> uint64_t bp; > >> struct amdgpu_bo *bo = NULL; > >> - int i, ret = 0; > >> + int i; > >> > >> if (!con || !con->eh_data) > >> return 0; > >> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct > >> amdgpu_device *adev) > >> data->last_reserved = i + 1; > >> bo = NULL; > >> } > >> - > >> - /* continue to save bad pages to eeprom even reesrve_vram fails */ > >> - ret = amdgpu_ras_save_bad_pages(adev); > >> out: > >> mutex_unlock(&con->recovery_lock); > >> - return ret; > >> + return 0; > >> } > >> > >> /* called when driver unload */ > >> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct > >> amdgpu_device *adev) > >> ret = amdgpu_ras_load_bad_pages(adev); > >> if (ret) > >> goto free; > >> - ret = amdgpu_ras_reserve_bad_pages(adev); > >> - if (ret) > >> - goto release; > >> + amdgpu_ras_reserve_bad_pages(adev); > >> } > >> > >> return 0; > >> > >> -release: > >> amdgpu_ras_release_bad_pages(adev); > >> free: > >> kfree((*data)->bps); > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > >> index f80fd34..12b0797 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > >> @@ -492,6 +492,8 @@ unsigned long > >> amdgpu_ras_query_error_count(struct amdgpu_device *adev, int > >> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev, > >> struct eeprom_table_record *bps, int pages); > >> > >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev); > >> + > >> int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev); > >> > >> static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, > >> @@ - > >> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct > >> amdgpu_device *adev, > >> * i2c may be unstable in gpu reset > >> */ > >> if (in_task()) > >> - amdgpu_ras_reserve_bad_pages(adev); > >> + amdgpu_ras_save_bad_pages(adev); > >> > >> if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0) > >> schedule_work(&ras->recovery_work); > >> -- > >> 2.7.4 > >> > >> _______________________________________________ > >> amd-gfx mailing list > >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx