[Bug 59761] New: Kernel fails to reset AMD HD5770 GPU properly and encounters OOPS. GPU reset fails - system remains in unusable state.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



https://bugzilla.kernel.org/show_bug.cgi?id=59761

           Summary: Kernel fails to reset AMD HD5770 GPU properly and
                    encounters OOPS. GPU reset fails - system remains in
                    unusable state.
           Product: Drivers
           Version: 2.5
    Kernel Version: 3.10 RC5
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: blocking
          Priority: P1
         Component: Video(DRI - non Intel)
        AssignedTo: drivers_video-dri@xxxxxxxxxxxxxxxxxxxx
        ReportedBy: t3st3r@xxxxxxx
        Regression: No


Intro:
This is really tricky bug. Probably GPU lockup itself is provoked by MESA and
is out of scope.
However, GPU lockup recovery is kernel work and that's where kernel fails in
this case. 

Configuration:
 Xubuntu 13.04 64 bit running 3.10 RC5 Linux kernel. Though similar problems
occurs with some older kernels as well (recent GPU reset handling rework not
seems to help much). 
 MESA should be recent 9.1 or 9.2 git to provoke GPU lockup condition. 
 GPU is AMD HD5770, 512Mb GDDR5.
 libtxc-dxtn-s2tc is installed to handle 

To reproduce:
 It's enough to run Ryzom RPG (www.ryzom.com) with 128Mb textures setting. 
 I'm using 64-bit version from launchpad PPA
(https://launchpad.net/~kervala/+archive/ppa)

Basically it's looks like following:
1) Launch game and let it run for some time using best (128Mb) textures on GPU
like my one. 
2) You can notice that on some objects textures are grabled/broken and don't
display properly. Maybe data transfer error or so. 
   Note: MESA before 9.1 lacks this bug and it will not occur. 
3) After some run time GPU would encounter lockup (CP stall). Probably MESA
does something wrong at code genreation. 
4) Then kernel attempts to reset GPU but it never works properly.
5) All graphic output locks up since GPU driver has failed to reset GPU
properly

This condition is quite fatal: system responds to alt-sysrq stuff but becomes
completely unusable due to lack of any graphic output.

Expected:
 GPU is properly reset and system recovers to usable state. No kernel errors
should happen during this process.

One of logs with crash data follows:

Jun 15 04:47:12 compname kernel: [17564.696695] radeon 0000:01:00.0: GPU lockup
CP stall for more than 10000msec
Jun 15 04:47:12 compname kernel: [17564.696706] radeon 0000:01:00.0: GPU lockup
(waiting for 0x00000000004ae9a1 last fence id 0x00000000004ae9a0)
Jun 15 04:47:12 compname kernel: [17564.697787] radeon 0000:01:00.0: Saved 119
dwords of commands on ring 0.
Jun 15 04:47:12 compname kernel: [17564.697812] BUG: unable to handle kernel
paging request at ffffc90012a9c418
Jun 15 04:47:12 compname kernel: [17564.697885] IP: [<ffffffffa03a2ace>]
radeon_fence_process+0x8e/0x160 [radeon]
Jun 15 04:47:12 compname kernel: [17564.697985] PGD 41f00f067 PUD 41f020067 PMD
417586067 PTE 0
Jun 15 04:47:12 compname kernel: [17564.698045] Oops: 0000 [#1] SMP 
Jun 15 04:47:12 compname kernel: [17564.698080] Modules linked in: parport_pc
ppdev bnep rfcomm bluetooth snd_hda_codec_hdmi kvm_amd kvm crc32_pclmul
ghash_clmulni_intel mxm_wmi aesni_intel aes_x86_64 lrw gf128mul glue_helper
ablk_helper cryptd snd_hda_codec_realtek microcode fam15h_power snd_hda_intel
serio_raw snd_ca0106 amd64_edac_mod edac_core snd_ac97_codec edac_mce_amd
snd_hda_codec k10temp ac97_bus snd_hwdep snd_pcm snd_seq_midi radeon
snd_page_alloc joydev sp5100_tco snd_seq_midi_event i2c_piix4 snd_rawmidi
snd_seq snd_seq_device snd_timer ttm drm_kms_helper drm snd i2c_algo_bit
soundcore mac_hid wmi xfs it87 hwmon_vid lp parport btrfs xor zlib_deflate
hid_generic usbhid hid raid6_pq libcrc32c usb_storage firewire_ohci
firewire_core pata_acpi crc_itu_t r8169 ahci pata_atiixp libahci
Jun 15 04:47:12 compname kernel: [17564.698840] CPU: 6 PID: 2925 Comm:
ryzom_client Not tainted 3.10.0-031000rc5-generic #201306082135
Jun 15 04:47:12 compname kernel: [17564.698914] Hardware name: Gigabyte
Technology Co., Ltd. 
Jun 15 04:47:12 compname kernel: [17564.698991] task: ffff880414908000 ti:
ffff880403afc000 task.ti: ffff880403afc000
Jun 15 04:47:12 compname kernel: [17564.699053] RIP: 0010:[<ffffffffa03a2ace>] 
[<ffffffffa03a2ace>] radeon_fence_process+0x8e/0x160 [radeon]
Jun 15 04:47:12 compname kernel: [17564.699160] RSP: 0018:ffff880403afdc18 
EFLAGS: 00010246
Jun 15 04:47:12 compname kernel: [17564.699205] RAX: ffffc90012a9c418 RBX:
0000000000000002 RCX: ffff880415134dc0
Jun 15 04:47:12 compname kernel: [17564.699264] RDX: 0000000000000041 RSI:
0000000000000000 RDI: ffff880415134000
Jun 15 04:47:12 compname kernel: [17564.699323] RBP: ffff880403afdc78 R08:
ffffffff00000000 R09: ffff880415134208
Jun 15 04:47:12 compname kernel: [17564.699382] R10: 0000000000000000 R11:
0000000000000005 R12: 000000000000000c
Jun 15 04:47:12 compname kernel: [17564.699441] R13: ffff880415134e08 R14:
0000000000000002 R15: ffff880415134000
Jun 15 04:47:12 compname kernel: [17564.699501] FS:  00007f6231639780(0000)
GS:ffff88042fd80000(0000) knlGS:0000000000000000
Jun 15 04:47:12 compname kernel: [17564.699567] CS:  0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Jun 15 04:47:12 compname kernel: [17564.699615] CR2: ffffc90012a9c418 CR3:
00000003d049d000 CR4: 00000000000407e0
Jun 15 04:47:12 compname kernel: [17564.699674] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Jun 15 04:47:12 compname kernel: [17564.699734] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Jun 15 04:47:12 compname kernel: [17564.699792] Stack:
Jun 15 04:47:12 compname kernel: [17564.699811]  ffff880417a9c848
ffffffffa042b3d0 ffff88041918b8a0 ffff880403afdc98
Jun 15 04:47:12 compname kernel: [17564.699886]  ffff880403afdc68
ffffffff8143357f 0000000000000001 ffff880415134000
Jun 15 04:47:12 compname kernel: [17564.699960]  0000000000000005
ffff880415134000 ffff880415134e38 ffff8804151345f8
Jun 15 04:47:12 compname kernel: [17564.700034] Call Trace:
Jun 15 04:47:12 compname kernel: [17564.700067]  [<ffffffff8143357f>] ?
__dev_printk+0x5f/0xa0
Jun 15 04:47:12 compname kernel: [17564.700141]  [<ffffffffa03a38c3>]
radeon_fence_count_emitted+0x23/0x70 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700234]  [<ffffffffa03b9fcb>]
radeon_ring_backup+0x4b/0x130 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700314]  [<ffffffffa038e560>]
radeon_gpu_reset+0x90/0x220 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700402]  [<ffffffffa03b8d36>]
radeon_gem_wait_idle_ioctl+0xd6/0x100 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700486]  [<ffffffffa02c658a>]
drm_ioctl+0x50a/0x650 [drm]
Jun 15 04:47:12 compname kernel: [17564.700568]  [<ffffffffa03b8c60>] ?
radeon_gem_busy_ioctl+0x120/0x120 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700632]  [<ffffffff81082401>] ?
update_curr+0x141/0x1f0
Jun 15 04:47:12 compname kernel: [17564.700684]  [<ffffffff810810dd>] ?
set_next_entity+0xad/0xd0
Jun 15 04:47:12 compname kernel: [17564.700738]  [<ffffffff811987c7>]
do_vfs_ioctl+0x87/0x330
Jun 15 04:47:12 compname kernel: [17564.700787]  [<ffffffff816cab14>] ?
__schedule+0x3d4/0x6b0
Jun 15 04:47:12 compname kernel: [17564.700837]  [<ffffffff81198b01>]
SyS_ioctl+0x91/0xb0
Jun 15 04:47:12 compname kernel: [17564.700885]  [<ffffffff816d5506>]
system_call_fastpath+0x1a/0x1f
Jun 15 04:47:12 compname kernel: [17564.700936] Code: 49 87 55 00 48 39 d0 73
50 48 89 c3 41 ba 01 00 00 00 41 80 bf a0 16 00 00 00 4d 8b b1 f8 0b 00 00 0f
84 8a 00 00 00 48 8b 41 10 <8b> 00 48 89 da 89 c0 4c 21 c2 48 09 d0 48 39 c3 76
0c 4c 89 f2 
Jun 15 04:47:12 compname kernel: [17564.701288] RIP  [<ffffffffa03a2ace>]
radeon_fence_process+0x8e/0x160 [radeon]
Jun 15 04:47:12 compname kernel: [17564.701375]  RSP <ffff880403afdc18>
Jun 15 04:47:12 compname kernel: [17564.701406] CR2: ffffc90012a9c418
Jun 15 04:47:12 compname kernel: [17564.735862] ---[ end trace 5017208705d52fa8
]---
Jun 15 04:47:17 compname kernel: [17570.145204] SysRq : Emergency Sync
Jun 15 04:47:17 compname kernel: [17570.153481] Emergency Sync complete

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux