[Advice Request] Trying to debug amdgpu fatal error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Christian,

Thanks for the response. That got me in the right direction.
After trial and error I found the cause - Thunderbolt Boot Support option
must be disabled in BIOS.
If I disable it I can boot to Ubuntu and looks like amdgpu inits okay. If I
enable with no other changes, init fails.

The last issue was one of my own - forgetting to use DRI_PRIME and xrandr
correctly.
Happy to say the Red Devil is working now in eGPU mode!
It's about a 20% perf loss over PCI-E slot and right in line with our
previous tests.

As always thank you for your continued time and support.
We'll be happy to give a shout out to you guys for the help at
article/video time.


Respectfully,

Daniel S. Moran (garwynn)
PC Hardware Editor - XDA-Developers
Phone: 1-559-316-0760/+81-90-5484-4155
Article Links: http://www.xda-developers.com/author/garwynn
E-mail: xdagarwynn at gmail.com | Twitter: @xdagarwynn

On Mon, Apr 9, 2018 at 10:48 PM, Christian König <christian.koenig at amd.com>
wrote:

> Hi Daniel,
>
> your problem is that the system BIOS is buggy and doesn't assign resources
> to the card:
>
>     Region 0: Memory at <ignored> (64-bit, prefetchable)
>     Region 2: Memory at <ignored> (64-bit, prefetchable)
>     Region 4: I/O ports at 9000 [size=256]
>     Region 5: Memory at <ignored> (32-bit, non-prefetchable)
>     Expansion ROM at <ignored> [disabled]
>
>
> The kernel actually tries to assign resources to the bridges, but fails as
> well because the BIOS didn't reserved any during startup.
>
> [    0.179743] pci 0000:12:00.0: can't claim BAR 14 [mem
> 0x01c00000-0xef0fffff]: no compatible bridge window
> [    0.179745] pci 0000:12:00.0: [mem 0x01c00000-0xef0fffff] clipped to
> [mem 0xef000000-0xef0fffff]
> [    0.179747] pci 0000:12:00.0:   bridge window [mem
> 0xef000000-0xef0fffff]
> [    0.179751] pci 0000:13:01.0: can't claim BAR 14 [mem
> 0x01c00000-0x01ffffff]: no compatible bridge window
> [    0.179753] pci 0000:14:00.0: can't claim BAR 14 [mem
> 0x01c00000-0x01ffffff]: no compatible bridge window
> [    0.179754] pci 0000:15:00.0: can't claim BAR 14 [mem
> 0x01d00000-0x01dfffff]: no compatible bridge window
> [    0.179756] pci 0000:08:04.0: can't claim BAR 13 [io  0xb000-0xcfff]:
> address conflict with PCI Bus 0000:12 [io  0x9000-0xbfff]
> [    0.179782] pci 0000:14:00.0: can't claim BAR 0 [mem
> 0x01c00000-0x01c03fff]: no compatible bridge window
> [    0.179789] pci 0000:16:00.0: can't claim BAR 0 [mem
> 0xd0000000-0xdfffffff 64bit pref]: no compatible bridge window
> [    0.179791] pci 0000:16:00.0: can't claim BAR 2 [mem
> 0xe0200000-0xe03fffff 64bit pref]: no compatible bridge window
> [    0.179793] pci 0000:16:00.0: can't claim BAR 5 [mem
> 0x01d00000-0x01d7ffff]: no compatible bridge window
> [    0.179798] pci 0000:16:00.1: can't claim BAR 0 [mem
> 0x01da0000-0x01da3fff]: no compatible bridge window
>
>
> There isn't much you can do except for trying to update the BIOS and if
> that doesn't help replace your motherboard.
>
> Regards,
> Christian.
>
>
> Am 09.04.2018 um 15:33 schrieb Daniel Moran:
>
> Christian,
> Andrey,
>
> Thank you for the responses.
> Here's the requested dmesg/lspci. Also pulled journalctl just in case but
> didn't see anything that stands out.
>
> I'll take another look at the BIOS settings to see if anything else may
> explain the memory error.
> I've got 16GB in the system at the moment, can bump up to 32 - also added
> a larger swap just in case that was the issue. (No change.)
>
> As always thank you for your continued time and support.
>
> Respectfully,
>
> Daniel S. Moran (garwynn)
> PC Hardware Editor - XDA-Developers
> Phone: 1-559-316-0760/+81-90-5484-4155
> Article Links: http://www.xda-developers.com/author/garwynn
> E-mail: xdagarwynn at gmail.com | Twitter: @xdagarwynn
>
> On Mon, Apr 9, 2018 at 3:52 PM, Christian König <christian.koenig at amd.com>
> wrote:
>
>> Please provide the full dmesg of the system as well as the output of
>> "lspci -s 0000:16:00.0 -vvvv" as attachment.
>>
>> Thanks,
>> Christian.
>>
>> Am 09.04.2018 um 06:00 schrieb Andrey Grodzovsky:
>>
>> Just from a quick look it seems to fail in amdgpu_device_init->ioremap
>> with ENOMEM, that would explain why you don't see any more prints - this
>> failure is very early in the device init process.
>>
>> No idea why ioremap would fail in this case and not even sure which
>> implementation of ioremap to look into for your case.
>>
>> Adding Christian for this.
>>
>> Andrey
>>
>> On 04/07/2018 03:16 AM, Daniel Moran wrote:
>>
>> Also, to clarify... if I move it into a regular slot, turn off the eGPU
>> it works as expected.
>> Tested with Intel iGPU enabled and disabled, made sure i915 loaded
>> without error and can connect display to it.
>>
>>
>>
>> Again, thank you in advance for any time/support offered.
>>
>> Respectfully,
>>
>> Daniel S. Moran (garwynn)
>> PC Hardware Editor - XDA-Developers
>> Phone: 1-559-316-0760/+81-90-5484-4155
>> Article Links: http://www.xda-developers.com/author/garwynn
>> E-mail: xdagarwynn at gmail.com | Twitter: @xdagarwynn
>>
>> On Sat, Apr 7, 2018 at 3:58 PM, Daniel Moran <xdagarwynn at gmail.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> I've got a Powercolor Red Devil Vega 56 here that I'm trying to get
>>> working in eGPU mode.
>>> I think on the BIOS/hardware side it's now all fleshed out.
>>> Now I'm at a point where amdgpu tries to init and reaches a fatal error.
>>>
>>> Set loglevel=8 doesn't get any additional messages.
>>> Here's what it does report (full dmesg attached):
>>>
>>> [  429.005909] [drm] amdgpu kernel modesetting enabled.
>>> [  429.006080] [drm] initializing kernel modesetting (VEGA10
>>> 0x1002:0x687F 0x148C:0x2388 0xC3).
>>> [  429.006082] amdgpu 0000:16:00.0: Fatal error during GPU init
>>> [  429.006155] amdgpu: probe of 0000:16:00.0 failed with error -12
>>>
>>> Using the following commands to unload & reload for testing. Since it's
>>> as an eGPU I'm using the i7-7700K iGPU (i915 module) as the primary and
>>> these commands work in terminal without requiring a reboot.
>>>
>>> sudo rmmod amdgpu
>>> sudo modprobe -v amgpu
>>>
>>> Pulled the UMR and tried to make, fails on Cmake. I'll attach log in a
>>> text.
>>> Also will attach a full dmesg and lspci dump. uname -a below:
>>> *Linux testbox 4.15.15-041515-generic #201803311331 SMP Sat Mar 31
>>> 17:34:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux*
>>>
>>> Any other ideas on how I can debug this further? Feel I'm so close,
>>> don't want to let this go.
>>> Thank you in advance for your time.
>>>
>>> Respectfully,
>>>
>>> Daniel S. Moran (garwynn)
>>> PC Hardware Editor - XDA-Developers
>>> Phone: 1-559-316-0760/+81-90-5484-4155
>>> Article Links: http://www.xda-developers.com/author/garwynn
>>> E-mail: xdagarwynn at gmail.com | Twitter: @xdagarwynn
>>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing listamd-gfx at lists.freedesktop.orghttps://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180410/758f2b35/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2018-04-07 16-08-59.png
Type: image/png
Size: 60529 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180410/758f2b35/attachment-0001.png>


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux