Please see my comment about the patch organization > -----Original Message----- > From: Chen, Guchun <Guchun.Chen@xxxxxxx> > Sent: Thursday, July 23, 2020 11:39 AM > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; Li, Dennis <Dennis.Li@xxxxxxx>; Yang, Stanley > <Stanley.Yang@xxxxxxx>; Clements, John <John.Clements@xxxxxxx> > Subject: RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during > bootup/reset > > [AMD Public Use] > > Thanks for your review, Tao. > Please check my comments after yours. > > Regards, > Guchun > > -----Original Message----- > From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > Sent: Thursday, July 23, 2020 10:51 AM > To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; Li, Dennis <Dennis.Li@xxxxxxx>; Yang, Stanley > <Stanley.Yang@xxxxxxx>; Clements, John <John.Clements@xxxxxxx> > Subject: RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during > bootup/reset > > [AMD Official Use Only - Internal Distribution Only] > > > -----Original Message----- > > From: Chen, Guchun <Guchun.Chen@xxxxxxx> > > Sent: Wednesday, July 22, 2020 11:14 AM > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > > <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; > > Li, Dennis <Dennis.Li@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>; > > Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Clements, John > <John.Clements@xxxxxxx> > > Cc: Chen, Guchun <Guchun.Chen@xxxxxxx> > > Subject: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during > > bootup/reset > > > > During boot up, when init eeprom, RAS will check if the BAD GPU tag is > > saved or not. And will break boot up and notify user to RMA it. > > > > At the meanwhile, when saved bad page count to eeprom exceeds > > threshold, one ras recovery will be triggered immediately, and bad gpu > > check will be executed and reset will fail as well to remind user. > > > > User could set bad_page_threshold = 0 as module parameter when probing > > driver to disable the bad feature check. > > > > Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 ++++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 25 +++++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 16 +++- > > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 87 > ++++++++++++++++++- > > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 6 +- > > 5 files changed, 142 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 2662cd7c8685..d529bf7917f5 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -2065,7 +2065,9 @@ static int amdgpu_device_ip_init(struct > > amdgpu_device *adev) > > * Note: theoretically, this should be called before all vram allocations > > * to protect retired page from abusing > > */ > > [Tao] The comment above should be also updated [Guchun]yeah, will update it > later. > > > -amdgpu_ras_recovery_init(adev); > > +r = amdgpu_ras_recovery_init(adev); > > +if (r) > > +goto init_failed; > > [Tao] Are you sure "r != 0" equals to "bad gpu"? Other errors of recovery_init > should not block gpu init. > [Guchun]Good point. This should be addressed carefully. > > > > if (adev->gmc.xgmi.num_physical_nodes > 1) > > amdgpu_xgmi_add_device(adev); @@ -4133,8 +4135,20 @@ static int > > amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, > > > > amdgpu_fbdev_set_suspend(tmp_adev, 0); > > > > -/* must succeed. */ > > -amdgpu_ras_resume(tmp_adev); > > +/* > > + * The CPU is BAD once faulty pages by ECC has > > + * reached the threshold, and ras recovery is > > + * scheduled. So add one check here to prevent > > + * recovery if it's one BAD GPU, and remind > > + * user to RMA this GPU. > > + */ > > +if (!amdgpu_ras_check_bad_gpu(tmp_adev)) { > > +/* must succeed. */ > > +amdgpu_ras_resume(tmp_adev); > > +} else { > > +r = -EINVAL; > > +goto out; > > +} > > > > /* Update PSP FW topology after reset */ if (hive && tmp_adev- > > >gmc.xgmi.num_physical_nodes > 1) @@ -4142,7 +4156,6 @@ static int > > amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, } } > > > > - > > out: > > if (!r) { > > amdgpu_irq_gpu_reset_resume_helper(tmp_adev); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > index e3d67d85c55f..818d4154e4a3 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > @@ -62,8 +62,6 @@ const char *ras_block_string[] = { #define > > ras_err_str(i) > > (ras_error_string[ffs(i)]) #define ras_block_str(i) > > (ras_block_string[i]) > > > > -#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS1 -#define > > AMDGPU_RAS_FLAG_INIT_NEED_RESET2 #define RAS_DEFAULT_FLAGS > > (AMDGPU_RAS_FLAG_INIT_BY_VBIOS) > > > > /* inject address is 52 bits */ > > @@ -1817,6 +1815,7 @@ int amdgpu_ras_recovery_init(struct > > amdgpu_device > > *adev) > > struct amdgpu_ras *con = amdgpu_ras_get_context(adev); struct > > ras_err_handler_data **data; uint32_t max_eeprom_records_len = 0; > > +bool bad_gpu = false; > > int ret; > > > > if (con) > > @@ -1838,9 +1837,14 @@ int amdgpu_ras_recovery_init(struct > > amdgpu_device > > *adev) > > max_eeprom_records_len = > > amdgpu_ras_eeprom_get_record_max_length(); > > amdgpu_ras_validate_threshold(adev, max_eeprom_records_len); > > > > -ret = amdgpu_ras_eeprom_init(&con->eeprom_control); > > -if (ret) > > +ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &bad_gpu); > > +/* > > + * we only fail this calling and halt booting when bad_gpu is true. > > + */ > > +if (bad_gpu) { > > +ret = -EINVAL; > > goto free; > > +} > > > > if (con->eeprom_control.num_recs) { > > ret = amdgpu_ras_load_bad_pages(adev); @@ -2189,3 > > +2193,16 @@ bool amdgpu_ras_need_emergency_restart(struct > > amdgpu_device *adev) > > > > return false; > > } > > + > > +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev) { struct > > +amdgpu_ras *con = amdgpu_ras_get_context(adev); bool bad_gpu = false; > > + > > +if (con && (con->bad_page_cnt_threshold != 0xFFFFFFFF)) > > +amdgpu_ras_eeprom_check_bad_gpu(&con->eeprom_control, > > +&bad_gpu); > > + > > +/* We are only interested in variable bad_gpu. */ return bad_gpu; } > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > index 4672649a9293..d7a363b37feb 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > @@ -31,6 +31,10 @@ > > #include "ta_ras_if.h" > > #include "amdgpu_ras_eeprom.h" > > > > +#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS(0x1 << 0) #define > > +AMDGPU_RAS_FLAG_INIT_NEED_RESET(0x1 << 1) #define > > +AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV(0x1 << 2) > > + > > enum amdgpu_ras_block { > > AMDGPU_RAS_BLOCK__UMC = 0, > > AMDGPU_RAS_BLOCK__SDMA, > > @@ -493,6 +497,8 @@ void amdgpu_ras_suspend(struct amdgpu_device > > *adev); unsigned long amdgpu_ras_query_error_count(struct > > amdgpu_device *adev, bool is_ce); > > > > +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev); > > + > > /* error handling functions */ > > int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev, struct > > eeprom_table_record *bps, int pages); @@ -503,10 > > +509,14 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device > > *adev) { > > struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); > > > > -/* save bad page to eeprom before gpu reset, > > - * i2c may be unstable in gpu reset > > +/* > > + * Save bad page to eeprom before gpu reset, i2c may be unstable > > + * in gpu reset. > > + * > > + * Also, exclude the case when ras recovery issuer is > > + * eerprom page write. > > */ > > -if (in_task()) > > +if (!(ras->flags & AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV) && > > in_task()) > > amdgpu_ras_reserve_bad_pages(adev); > > > > if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0) diff --git > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > index a2c982b1eac6..96b63c026bad 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > [Tao] It's better to split it into two patches, one is for ras and another one is for > eeprom [Guchun]I tried this, however, even if for ras change, eeprom change is > needed as well. So I merged them both into one patch. [Tao] We may need to split it into more parts like this: Part1: preparation for ras Part2: preparation for eeprom Part3: main part for ras Part4: main part for eeprom > > > @@ -46,6 +46,9 @@ > > #define EEPROM_TABLE_HDR_VAL 0x414d4452 #define > EEPROM_TABLE_VER > > 0x00010000 > > > > +/* Bad GPU tag ‘BADG’ */ > > +#define EEPROM_TABLE_HDR_BAD 0x42414447 > > + > > /* Assume 2 Mbit size */ > > #define EEPROM_SIZE_BYTES 256000 > > #define EEPROM_PAGE__SIZE_BYTES 256 > > @@ -238,12 +241,14 @@ int amdgpu_ras_eeprom_reset_table(struct > > amdgpu_ras_eeprom_control *control) > > > > } > > > > -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control) > > +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control, > > +bool *bad_gpu) > > { > > int ret = 0; > > struct amdgpu_device *adev = to_amdgpu_device(control); unsigned > > char buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE] = { 0 }; > > struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr; > > +struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); > > struct i2c_msg msg = { > > .addr= 0, > > .flags= I2C_M_RD, > > @@ -251,6 +256,8 @@ int amdgpu_ras_eeprom_init(struct > > amdgpu_ras_eeprom_control *control) .buf= buff, }; > > > > +*bad_gpu = false; > > + > > /* Verify i2c adapter is initialized */ if (!adev->pm.smu_i2c.algo) > > return -ENOENT; @@ -279,6 +286,11 @@ int amdgpu_ras_eeprom_init(struct > > amdgpu_ras_eeprom_control *control) DRM_DEBUG_DRIVER("Found > existing > > EEPROM table with %d records", > > control->num_recs); > > > > +} else if ((ras->bad_page_cnt_threshold != 0xFFFFFFFF) && ( > > +hdr->header == EEPROM_TABLE_HDR_BAD)) { > > +*bad_gpu = true; > > +DRM_ERROR("Detect BAD GPU TAG in eeprom table and " > > +"GPU shall be RMAed.\n"); > > } else { > > DRM_INFO("Creating new EEPROM table"); > > > > @@ -375,6 +387,44 @@ static uint32_t > > __correct_eeprom_dest_address(uint32_t curr_address) return > > curr_address; } > > > > +int amdgpu_ras_eeprom_check_bad_gpu(struct > amdgpu_ras_eeprom_control > > *control, > > +bool *bad_gpu) > > +{ > > +struct amdgpu_device *adev = to_amdgpu_device(control); unsigned char > > +buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE] = { 0 }; struct > > +amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr; struct > > +i2c_msg msg = { .addr = control->i2c_address, .flags = I2C_M_RD, .len > > += EEPROM_ADDRESS_SIZE + > > EEPROM_TABLE_HEADER_SIZE, > > +.buf = buff, > > +}; > > +int ret; > > + > > +*bad_gpu = false; > > + > > +/* read EEPROM table header */ > > +mutex_lock(&control->tbl_mutex); > > +ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1); if (ret < 1) { > > +dev_err(adev->dev, "Failed to read EEPROM table header.\n"); goto > > +err; > > [Tao] One tab is missed > [Guchun]Correct this later. > > > +} > > + > > +__decode_table_header_from_buff(hdr, &buff[2]); > > + > > +if (hdr->header == EEPROM_TABLE_HDR_BAD) { dev_warn(adev->dev, > > +"Current GPU is BAD and should be > > RMAed.\n"); > > +*bad_gpu = true; > > +} > > + > > +ret = 0; > > +err: > > +mutex_unlock(&control->tbl_mutex); > > +return ret; > > +} > > + > > int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control > > *control, > > struct eeprom_table_record > > *records, > > bool write, > > @@ -383,8 +433,10 @@ int amdgpu_ras_eeprom_process_recods(struct > > amdgpu_ras_eeprom_control *control, > > int i, ret = 0; > > struct i2c_msg *msgs, *msg; > > unsigned char *buffs, *buff; > > +bool sched_ras_recovery = false; > > struct eeprom_table_record *record; > > struct amdgpu_device *adev = to_amdgpu_device(control); > > +struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); > > > > if (adev->asic_type != CHIP_VEGA20 && adev->asic_type != > > CHIP_ARCTURUS) > > return 0; > > @@ -402,6 +454,25 @@ int amdgpu_ras_eeprom_process_recods(struct > > amdgpu_ras_eeprom_control *control, > > goto free_buff; > > } > > > > +/* > > + * If saved bad pages number exceeds the bad page threshod for > > + * the whole VRAM, update table header to mark one BAD GPU and > > + * schedule one ras recovery after eeprom write is done, this > > + * can avoid the missing for latest records. > > + * > > + * This new header will be picked up and checked in the bootup by > > + * ras recovery, which may break bootup process to notify user this > > + * GPU is bad and to RMA(Return Merchandise Authorization) this GPU. > > + */ > > +if (write && (ras->bad_page_cnt_threshold != 0xFFFFFFFF) && > > +((control->num_recs + num) >= ras->bad_page_cnt_threshold)) { > > +dev_warn(adev->dev, "Saved bad pages(%d) reaches threshold > > +value(%d).\n", > > +control->num_recs + num, ras- > > >bad_page_cnt_threshold); > > +control->tbl_hdr.header = EEPROM_TABLE_HDR_BAD; > > +sched_ras_recovery = true; > > +} > > + > > /* In case of overflow just start from beginning to not lose newest > > records */ if (write && (control->next_addr + > > EEPROM_TABLE_RECORD_SIZE * num > EEPROM_SIZE_BYTES)) > > control->next_addr = EEPROM_RECORD_START; @@ -482,6 > > +553,20 @@ int amdgpu_ras_eeprom_process_recods(struct > > amdgpu_ras_eeprom_control *control, > > __update_tbl_checksum(control, records, num, old_hdr_byte_sum); > > > > __update_table_header(control, buffs); > > + > > +if (sched_ras_recovery) { > > +/* > > + * Before scheduling ras recovery, assert the related > > + * flag first, which shall bypass common bad page > > + * reservation execution in amdgpu_ras_reset_gpu. > > + */ > > +amdgpu_ras_get_context(adev)->flags |= > > +AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV; > > + > > +dev_warn(adev->dev, "Conduct ras recovery due to bad > > " > > +"page threshold reached.\n"); > > +amdgpu_ras_reset_gpu(adev); > > +} > > } else if (!__validate_tbl_checksum(control, records, num)) { > > DRM_WARN("EEPROM Table checksum mismatch!"); > > /* TODO Uncomment when EEPROM read/write is relliable */ diff --git > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > index b272840cb069..4ccd139248a9 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > @@ -77,9 +77,13 @@ struct eeprom_table_record { unsigned char > > mcumc_id; }__attribute__((__packed__)); > > > > -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control > > *control); > > +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control, > > +bool *bad_gpu); > > int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control > > *control); > > > > +int amdgpu_ras_eeprom_check_bad_gpu(struct > amdgpu_ras_eeprom_control > > *control, > > +bool *bad_gpu); > > + > > int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control > > *control, > > struct eeprom_table_record > > *records, > > bool write, > > -- > > 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx