[AMD Official Use Only - AMD Internal Distribution Only] OK ----------------- Best Regards, Thomas -----Original Message----- From: Yang, Stanley <Stanley.Yang@xxxxxxx> Sent: Monday, July 1, 2024 2:41 PM To: Chai, Thomas <YiPeng.Chai@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx> Subject: RE: [PATCH V2] drm/amdgpu: sysfs node disable query error count during gpu reset [AMD Official Use Only - AMD Internal Distribution Only] Hi Thomas, I think we can optimize the amdgpu_ras_set_error_query_ready(adev, true) function calling during GPU recovery, amdgpu_ras_set_error_query_ready(tmp_adev, false) -> recovery start -> recovery done -> amdgpu_ras_set_error_query_ready(tmp_adev, true), above process can avoid access query error count during GPU recovery. Regards, Stanley > -----Original Message----- > From: Chai, Thomas <YiPeng.Chai@xxxxxxx> > Sent: Monday, July 1, 2024 11:19 AM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Zhou1, Tao > <Tao.Zhou1@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>; Wang, > Yang(Kevin) <KevinYang.Wang@xxxxxxx>; Yang, Stanley > <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx> > Subject: [PATCH V2] drm/amdgpu: sysfs node disable query error count > during gpu reset > > Sysfs node disable query error count during gpu reset. > > Signed-off-by: YiPeng Chai <YiPeng.Chai@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 15 +++++++++++++-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index ac7ded01dad0..a65b5197b0fc 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -619,6 +619,7 @@ static const struct file_operations > amdgpu_ras_debugfs_eeprom_ops = { static ssize_t > amdgpu_ras_sysfs_read(struct device *dev, > struct device_attribute *attr, char *buf) { > + int ret; > struct ras_manager *obj = container_of(attr, struct ras_manager, > sysfs_attr); > struct ras_query_if info = { > .head = obj->head, > @@ -627,7 +628,10 @@ static ssize_t amdgpu_ras_sysfs_read(struct > device *dev, > if (!amdgpu_ras_get_error_query_ready(obj->adev)) > return sysfs_emit(buf, "Query currently > inaccessible\n"); > > - if (amdgpu_ras_query_error_status(obj->adev, &info)) > + ret = amdgpu_ras_query_error_status(obj->adev, &info); > + if (ret == -EIO) /* gpu reset is ongoing */ > + return sysfs_emit(buf, "Query currently inaccessible\n"); > + else if (ret) > return -EINVAL; > > if (amdgpu_ip_version(obj->adev, MP0_HWIP, 0) != IP_VERSION(11, > 0, 2) && @@ -1290,12 +1294,19 @@ static int > amdgpu_aca_log_ras_error_data(struct amdgpu_device *adev, enum amdgpu > ssize_t amdgpu_ras_aca_sysfs_read(struct device *dev, struct > device_attribute *attr, > struct aca_handle *handle, char *buf, > void > *data) { > + int ret; > struct ras_manager *obj = container_of(handle, struct > ras_manager, aca_handle); > struct ras_query_if info = { > .head = obj->head, > }; > > - if (amdgpu_ras_query_error_status(obj->adev, &info)) > + if (!amdgpu_ras_get_error_query_ready(obj->adev)) > + return sysfs_emit(buf, "Query currently > + inaccessible\n"); > + > + ret = amdgpu_ras_query_error_status(obj->adev, &info); > + if (ret == -EIO) /* gpu reset is ongoing */ > + return sysfs_emit(buf, "Query currently inaccessible\n"); > + else if (ret) > return -EINVAL; > > return sysfs_emit(buf, "%s: %lu\n%s: %lu\n%s: %lu\n", "ue", > info.ue_count, > -- > 2.34.1