Re: [Bug 105733] Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



What I want to know is what is calling your machine ‘localhorst’? 

Sent from my iPhone

> On Nov 20, 2018, at 9:15 AM, bugzilla-daemon@xxxxxxxxxxxxxxx wrote:
> 
> Comment # 47 on bug 105733 from Allan
> I have really bad news.
> 
> I'm delaying a lot to answer because I literally sent for warranty or replaced
> ALL of my components in the PC.
> 
> The CPU (R7 1800X) was replaced from a batch 21 to a new by AMD itself batched
> 35.
> 
> But OK, let's talk about the amdgpu :
> 
> (In reply to Andrey Grodzovsky from comment #25)
> > (In reply to Allan from comment #12)
> > Can you build latest kernel (4.18) and grab again latest firmware and try
> > again ?
> > Links to kernel and firmware:
> > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ 
> 
> For reasons already explained here I couldn't either compile or test it before,
> so please don't be mad with me :
> - Sold my old PC.
> - My notebook was completely filled with files.
> - Components on warranty. Testing everything else.
> 
> So I managed to borrow a PC to test the video cards. I have tested only the
> nvidia one to prove for AMD that the GPU is working and the pci-controller (a
> guess of mine) of the CPU/chipset that is broken. Going to test the RX480 on
> this PC as soon as possible. My warranties are expiring and I had to enumerate
> priorities.
> 
> I already said it here but, with the 1800X I couldn't even clone the git
> repository (the checksum always fails, tried many times).
> 
> Then I managed to free some space on my notebook and started to build
> yesterday.
> - Included amd-ucode firmware.
> - Included polaris10 firmware (for RX480).
> - Made some optimizations for ryzen as descbribed on the gentoo's dedicated
> page.
> 
> Compiled, version 4.20-rc1 as present in the branch. No errors reported.
> 
> There are 2 main applications that are easier to test right now to find the
> problems :
> - Metro 2033 Redux through steam.
> - Left for Dead 2 through steam.
> 
> Started Metro 2033, worked for some minutes with no issue, but it was for some
> reason without any sound. Closed. Turned off the HDMI audio on pavucontrol to
> use only the default output. Restarted steam.
> 
> Started Left for Dead 2 this time. Was able to change graphics settings to max
> without AA and vsync. Played for 15 seconds and got a screen freeze. Waited for
> a script to record properly the logs and temps. Hard rebooted. This time even
> my BIOS/EFI screen had a green background, but still operational. Everything
> was green except the text. Rebooted again, got back to normal colors.
> 
> And here are the logs :
> 
> kern.log about Firefox usage :
> > Nov 14 05:26:50 desk kernel: [  324.714998] Chrome_~dThread[1788]: segfault at 0 ip 00007fbfee5e3181 sp 00007fbfec2d1ad0 error 6 in libxul.so[7fbfee5cf000+3a2c000]
> 
> It points that the CPU stills with either a problematic microcode or is
> defective.
> 
> dmesg about amdgpu screen freeze :
> > [ 3323.920795] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000080c for process hl2_linux pid 14648 thread amdgpu_cs:0 pid 14653
> > [ 3323.920799] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
> > [ 3323.920801] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200800C
> > [ 3323.920804] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 1, pasid 32774) at page 0, read from 'TC0' (0x54433000) (8)
> > [ 3334.103233] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=274140, emitted seq=274142
> > [ 3334.103239] amdgpu 0000:09:00.0: GPU reset begin!
> > [ 3344.332607] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:46:crtc-0] hw_done or flip_done timed out
> > [ 3504.834097] INFO: task kworker/u32:2:3872 blocked for more than 120 seconds.
> > [ 3504.834103]       Not tainted 4.20.0-rc1-amd #2
> > [ 3504.834105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 3504.834107] kworker/u32:2   D    0  3872      2 0x80000000
> > [ 3504.834123] Workqueue: events_unbound commit_work [drm_kms_helper]
> > [ 3504.834126] Call Trace:
> > [ 3504.834133]  ? __schedule+0x2a0/0x880
> > [ 3504.834136]  schedule+0x28/0x80
> > [ 3504.834139]  schedule_timeout+0x25d/0x380
> > [ 3504.834217]  ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu]
> > [ 3504.834292]  ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
> > [ 3504.834297]  dma_fence_default_wait+0x23b/0x2a0
> > [ 3504.834301]  ? dma_fence_release+0x90/0x90
> > [ 3504.834304]  dma_fence_wait_timeout+0xdd/0x100
> > [ 3504.834308]  reservation_object_wait_timeout_rcu+0x161/0x270
> > [ 3504.834387]  amdgpu_dm_do_flip+0x112/0x370 [amdgpu]
> > [ 3504.834468]  amdgpu_dm_atomic_commit_tail+0x68b/0xcd0 [amdgpu]
> > [ 3504.834472]  ? __switch_to_asm+0x40/0x70
> > [ 3504.834475]  ? wait_for_completion_timeout+0x3b/0x1a0
> > [ 3504.834477]  ? __switch_to_asm+0x34/0x70
> > [ 3504.834480]  ? __switch_to_asm+0x40/0x70
> > [ 3504.834483]  ? __switch_to+0x1ba/0x450
> > [ 3504.834492]  commit_tail+0x3d/0x70 [drm_kms_helper]
> > [ 3504.834497]  process_one_work+0x1aa/0x3a0
> > [ 3504.834500]  worker_thread+0x30/0x3a0
> > [ 3504.834503]  ? drain_workqueue+0x130/0x130
> > [ 3504.834506]  kthread+0x11d/0x140
> > [ 3504.834509]  ? kthread_park+0x80/0x80
> > [ 3504.834512]  ret_from_fork+0x22/0x40
> > [ 3516.645267] WARNING: CPU: 14 PID: 14694 at kernel/kthread.c:501 kthread_park+0x6c/0x80
> > [ 3516.645271] Modules linked in: fuse edac_mce_amd kvm_amd nls_ascii nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec joydev amdgpu snd_hda_core snd_hwdep chash gpu_sched snd_pcm snd_timer ttm drm_kms_helper snd drm i2c_algo_bit sp5100_tco soundcore kvm efi_pstore efivars sg irqbypass evdev wmi_bmof serio_raw pcspkr k10temp ccp tpm_crb pcc_cpufreq tpm_tis tpm_tis_core tpm rng_core acpi_cpufreq button parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto btrfs xor zstd_decompress zstd_compress xxhash raid6_pq libcrc32c crc32c_generic algif_skcipher af_alg dm_crypt dm_mod sd_mod hid_generic usbhid hid uas usb_storage crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ahci xhci_pci aes_x86_64 libahci crypto_simd xhci_hcd cryptd glue_helper libata r8169 i2c_piix4 libphy usbcore scsi_mod thermal wmi gpio_amdpt gpio_generic
> > [ 3516.645324] CPU: 14 PID: 14694 Comm: TaskSchedulerFo Not tainted 4.20.0-rc1-amd #2
> > [ 3516.645327] Hardware name: BIOSTAR Group X370GT7/X370GT7, BIOS 5.13 08/07/2018
> > [ 3516.645330] RIP: 0010:kthread_park+0x6c/0x80
> > [ 3516.645333] Code: 18 e8 88 6c 67 00 be 40 00 00 00 48 89 df e8 8b c3 00 00 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 80 00 00 00 00 0f 1f
> > [ 3516.645335] RSP: 0018:ffffbafdc3fcfb60 EFLAGS: 00010202
> > [ 3516.645338] RAX: 0000000000000004 RBX: ffff9dcd93f140c0 RCX: dead000000000200
> > [ 3516.645339] RDX: ffff9dcd92ba7430 RSI: ffff9dcd93f140c0 RDI: ffff9dcd8a9049c0
> > [ 3516.645341] RBP: ffff9dcd940a5360 R08: ffff9dcd96da25a8 R09: 0000000000000000
> > [ 3516.645343] R10: 0000000000000000 R11: 000000000000019c R12: ffff9dcd92ba27a0
> > [ 3516.645344] R13: ffff9dcd76d34200 R14: 0000000000000206 R15: dead000000000100
> > [ 3516.645347] FS:  00007efea483e700(0000) GS:ffff9dcd96d80000(0000) knlGS:0000000000000000
> > [ 3516.645349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 3516.645351] CR2: 00005654fe725e10 CR3: 0000000200d40000 CR4: 00000000003406e0
> > [ 3516.645352] Call Trace:
> > [ 3516.645362]  drm_sched_entity_fini+0x37/0x190 [gpu_sched]
> > [ 3516.645423]  amdgpu_vm_fini+0xad/0x530 [amdgpu]
> > [ 3516.645429]  ? idr_destroy+0x78/0xc0
> > [ 3516.645481]  amdgpu_driver_postclose_kms+0x151/0x270 [amdgpu]
> > [ 3516.645496]  drm_file_free.part.5+0x21f/0x300 [drm]
> > [ 3516.645510]  drm_release+0xaa/0x120 [drm]
> > [ 3516.645514]  __fput+0xac/0x1e0
> > [ 3516.645518]  task_work_run+0x8f/0xb0
> > [ 3516.645522]  do_exit+0x2e6/0xb30
> > [ 3516.645525]  do_group_exit+0x3a/0xb0
> > [ 3516.645528]  get_signal+0x27a/0x5f0
> > [ 3516.645532]  do_signal+0x30/0x6d0
> > [ 3516.645537]  exit_to_usermode_loop+0x89/0xf0
> > [ 3516.645540]  do_syscall_64+0xda/0xe0
> > [ 3516.645544]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 3516.645547] RIP: 0033:0x7efeb6b9d19a
> > [ 3516.645553] Code: Bad RIP value.
> > [ 3516.645555] RSP: 002b:00007efea483d810 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> > [ 3516.645557] RAX: fffffffffffffdfc RBX: 00007efea483d958 RCX: 00007efeb6b9d19a
> > [ 3516.645559] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007efea483d980
> > [ 3516.645560] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffe661d7080
> > [ 3516.645562] R10: 00007efea483d860 R11: 0000000000000246 R12: 0000000000000000
> > [ 3516.645564] R13: 00007efea483d980 R14: 00007efea483d990 R15: 00007efea483d930
> > [ 3516.645566] ---[ end trace 7da35ac4aa65c90d ]---
> 
> It is important to note that the most common code that appears while using
> generic kernels is 147 despite of 146 that is being shown here.
> 
> Xorg.0.log reports nothing.
> 
> I said that these were bad news because seems to me that both CPU and amdgpu
> driver are defective.
> 
> I noticed that while running kernel 4.18 the gpu is kept at 100% (mclk and
> sclk) all the time while with this new kernel the GPU tries to scale the
> performance.
> 
> Also, it is important to note that the nvidia GTX 1070 throws a lot of xid
> error codes ( see
> https://devtalk.nvidia.com/default/topic/1043483/linux/xid-errors-on-gtx-1070-linux/post/5293440
> ). And this is why I'm thinking that the 1800X has a defective pci-controller.
> And it is also the second part of the "really bad news". Maybe it is happening
> mostly with ryzen processors? I'll test the RX480 with the other computer ASAP,
> need to send informations about the CPU for AMD to proceed with the warranty
> process.
> 
> The GTX 1070 works without a single problem outside of this PC. The other cards
> that I had tested before follows the same pattern ( 2 RX480, 1 RX 580, 1 GTX
> 970, 1 GTX 1070).
> 
> Currently I have only 1 RX480 and 1 GTX 1070. Now that I know that the cards
> don't have any problem I'm selling the cards and soon I'll have only one or
> none. The seller told me off because of requesting warranty for the RX 480 when
> I thought it was defective, he sent me another different and the one that I
> sent was working without any issues according to him.
> 
> I'm already in a new stage of re-sending the CPU for AMD, and praying to solve
> my endless torment. I think that they'll have to refund me (and then I'll have
> a loss with the motherboard).
> 
> Please tell me any other step that you may want to be done.
> 
> I can also provide a full description of the kernel compilation (parameters)
> and even provide a link to the generated .deb packages.
> You are receiving this mail because:
> You are the assignee for the bug.
> _______________________________________________
> dri-devel mailing list
> dri-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux