On Tue, Jun 06, 2017 at 04:00:29PM +0800, Christian König wrote: > Hi Ray, > > mhm, indeed a nice catch. > > But why do we need to load the gpu info after resume in the first place? > > I mean we already know what GPU we have, loading it again looks > superfluous to me. > Yes, I agree with you. That's also my orignal opinion. But we encountered a random buggy when we were calling device_cache_fw_images. [ 558.288976] cache_firmware: amdgpu/vega10_sdma1.bin [ 558.288976] cache_firmware: amdgpu/vega10_sdma.bin ret=0 [ 558.288981] fw_set_page_data: fw-amdgpu/vega10_sdma1.bin buf=ffff8803f1e64a80 data=ffffc90002411000 size=17408 [ 558.288981] cache_firmware: amdgpu/vega10_sdma1.bin ret=0 [ 558.288997] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 558.289001] IP: devres_for_each_res+0x5e/0x100 [ 558.289001] PGD 0 [ 558.289002] Oops: 0000 [#3] SMP [ 558.289003] Modules linked in: joydev hid_generic usbhid amdgpu(OE) ttm(OE) drm_kms_helper(OE) drm(OE) i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt rpcsec_gss_krb5 nfsv4 nfs fscache snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core intel_rapl snd_hwdep x86_pkg_temp_thermal intel_powerclamp snd_pcm kvm_intel snd_seq_midi kvm snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer irqbypass snd crct10dif_pclmul soundcore crc32_pclmul ghash_clmulni_intel pcbc mei_me aesni_intel shpchp mei aes_x86_64 crypto_simd glue_helper mac_hid cryptd acpi_pad tpm_infineon nfsd auth_rpcgss nfs_acl coretemp lockd grace sunrpc parport_pc ppdev lp parport autofs4 e1000e ptp nvme mxm_wmi ahci i2c_hid pps_core libahci nvme_core wmi video hid [ 558.289027] CPU: 0 PID: 3742 Comm: pm-suspend Tainted: G D OE 4.11.0-custom #7 [ 558.289027] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016 [ 558.289027] task: ffff8803ebdcd940 task.stack: ffffc900029b0000 [ 558.289029] RIP: 0010:devres_for_each_res+0x5e/0x100 [ 558.289029] RSP: 0018:ffffc900029b3bc8 EFLAGS: 00010086 [ 558.289030] RAX: 000000000000001d RBX: ffff880426aa0c18 RCX: 0000000000000000 [ 558.289030] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000092 [ 558.289031] RBP: ffffc900029b3c20 R08: 000000000000001d R09: ffffffff821ee601 [ 558.289031] R10: 000000000000141d R11: 0000000000000000 R12: ffffffff81566590 [ 558.289032] R13: ffffffff81566870 R14: ffffc900029b3c30 R15: ffff880426aa0e98 [ 558.289032] FS: 00007f63006c3700(0000) GS:ffff88043ec00000(0000) knlGS:0000000000000000 [ 558.289033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 558.289033] CR2: 0000000000000008 CR3: 00000003f257e000 CR4: 00000000003406f0 [ 558.289034] Call Trace: [ 558.289036] ? alloc_fw_cache_entry+0x60/0x60 [ 558.289037] ? request_firmware_nowait+0x140/0x140 [ 558.289038] dev_cache_fw_image+0x46/0x120 [ 558.289039] ? request_firmware_nowait+0x140/0x140 [ 558.289040] dpm_for_each_dev+0x44/0x70 [ 558.289041] fw_pm_notify+0x164/0x190 [ 558.289043] ? prepare_to_wait_event+0x110/0x110 [ 558.289044] notifier_call_chain+0x49/0x70 [ 558.289046] __blocking_notifier_call_chain+0x4d/0x70 [ 558.289047] __pm_notifier_call_chain+0x1f/0x40 [ 558.289047] pm_suspend+0x27f/0x3a0 [ 558.289048] state_store+0x80/0xf0 [ 558.289050] kobj_attr_store+0xf/0x20 [ 558.289051] sysfs_kf_write+0x3a/0x50 [ 558.289053] kernfs_fop_write+0xff/0x180 [ 558.289054] __vfs_write+0x28/0x120 [ 558.289056] ? apparmor_file_permission+0x1a/0x20 So then I check these functions and find gpu_info errors. The random buggy cannot be reproduced constantly.But we expected it can pass more than 30 cycles of S3 suspend and resume. Any ideas? Thanks, Ray