On Tue, Jun 6, 2017 at 10:52 AM, Huang Rui <ray.huang at amd.com> wrote: > On Tue, Jun 06, 2017 at 10:45:42PM +0800, Alex Deucher wrote: >> On Tue, Jun 6, 2017 at 10:03 AM, Alex Deucher <alexdeucher at gmail.com> wrote: >> > On Tue, Jun 6, 2017 at 7:22 AM, Christian König >> > <christian.koenig at amd.com> wrote: >> >>> Yes, I agree with you. That's also my orignal opinion. >> >>> But we encountered a random buggy when we were calling >> >>> device_cache_fw_images. >> >> >> >> That looks like an upstream bug in device_cache_fw_images. >> >> >> >> We should probably open a bug report and ping the maintainer. Most likely we >> >> are not correctly using the FW interface or trigger a rare bug or something >> >> like this. >> >> >> >>> So then I check these functions and find gpu_info errors. The random buggy >> >>> cannot be reproduced constantly.But we expected it can pass more than 30 >> >>> cycles >> >>> of S3 suspend and resume. Any ideas? >> >> >> >> I think the real solution is to just stop calling >> >> amdgpu_device_parse_gpu_info_fw() during resume. >> > >> > Right. we only need to parse the firmware once during startup. >> >> How are hitting this on resume? amdgpu_device_parse_gpu_info_fw() is >> called indirectly from amdgpu_device_init() which is only called once >> at driver load time. >> > > Yes, I also noted it. So I am confused with why firmware_class will still > cache it during suspend. I guess request_firmware expects the driver to keep the firmware around until the driver is unloaded. Just one comment about the patch: @@ -1272,6 +1272,9 @@ struct amdgpu_firmware { const struct amdgpu_psp_funcs *funcs; struct amdgpu_bo *rbuf; struct mutex mutex; + + /* gpu info firmware data pointer */ + const struct firmware *fw; }; Call this gpu_info_fw rather than just fw. With that, the patch is: Reviewed-by: Alex Deucher <alexander.deucher at amd.com> Alex